Package 'forestRK' reference manual

Title:	Implements the Forest-R.K. Algorithm for Classification Problems
Description:	Provides functions that calculates common types of splitting criteria used in random forests for classification problems, as well as functions that make predictions based on a single tree or a Forest-R.K. model; the package also provides functions to generate importance plot for a Forest-R.K. model, as well as the 2D multidimensional-scaling plot of data points that are colour coded by their predicted class types by the Forest-R.K. model. This package is based on: Bernard, S., Heutte, L., Adam, S., (2008, ISBN:978-3-540-85983-3) "Forest-R.K.: A New Random Forest Induction Method", Fourth International Conference on Intelligent Computing, September 2008, Shanghai, China, pp.430-437.
Authors:	Hyunjin Cho [aut, cre], Rebecca Su [ctb]
Maintainer:	Hyunjin Cho <[email protected]>
License:	GPL (>= 3) \| file LICENSE
Version:	0.0-5
Built:	2025-03-05 04:02:22 UTC
Source:	https://github.com/h56cho/forestrk

Performs bootstrap sampling of the (training) dataset

Description

Performs bootstrap sampling of our (training) dataset; this function is used inside of the forestRK function.

Usage

 bstrap(dat = data.frame(), nbags, samp.size)
bstrap(dat = data.frame(), nbags, samp.size)

Arguments

`dat`	a numericized data frame that stores both the covariates of the observations and their numericized class types `y`; `dat` should contain no `NA` or `NaN`'s.
`nbags`	the number of bags or the number of bootstrap samples that we want to generate.
`samp.size`	the number of samples that each bag (individual bootstrap sample) should contain.

Value

A list containing a data frames of bootstrap samples generated from dat.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  # combine the covariates x with class types y
  b <- data.frame(cbind(x.train, y.train))

  ## bstrp function example
  bootstrap.sample <- bstrap(dat = b, nbags = 20, samp.size = 30)
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  # combine the covariates x with class types y
  b <- data.frame(cbind(x.train, y.train))

  ## bstrp function example
  bootstrap.sample <- bstrap(dat = b, nbags = 20, samp.size = 30)

Constructs a classification tree on the (training) dataset, by implementing the RK (Random 'K') algorithm

Description

Constructs a classification tree based on the dataset of interest by implementing the RK (Random 'K') algorithm.

The package rapportools is loaded internally when this function is called; this is to use the method is.boolean to check one of the stopping criteria in the beginning of the function. The functions specifically from the forestRK package that are being used inside construct.treeRK are criteria.calculator and cutoff.node.and.covariate.index.finder.

The construct.treeRK output is one of the arguments that is used to call the pred.treeRK function.

DESCRIPTIONS OF THE RETURNED VALUES:

The hirarchical flag of a rktree (construct.treeRK()$flag) is constructed in the following way:

(1) the first entry of the flag, "r" denotes for "root"; (2) the subsequent strings of the flag is constructed in the way that last "x" denotes for the left child node of the node represented by the series of characters that are before the last "x", and the last "y" denotes for the right child node of the node represented by the series of characters that are before the last "y".

For example, the flag "rxyx" is the left child node of the node represented by "rxy".

x.node.list and y.node.list are the lists of children nodes (for x and y, respectively) of the rktree, listed in the order consistent to the order of the nodes represented in the rktree's hirarchical flag.

covariate.split is a matrix that lists the numericized covariate names that were used for the splits to construct the rktree. The first entry of covariate.split is NA, which stands for the condition at the root. The number immediately underneath NA is the numericized covariate name that was used for the first split in the rktree, and the number below that is the numericized covariate name that was used for the second split, etc. If the numericized covariate name listed under covariate.split is the number "n", this corresponds to the "n"-th covariate or the name of the "n"-th column of the data frame x.train.

value.at.split is a vector that lists the actual values of the covariates at which the split had occured while constructing the rktree. The first entry of value.at.split is NA, which denotes for the root prior to any splits. To give an example of how to interpret the value.at.split, if the second entry appear in the covariate.split is 4, and the second entry appear under value.at.split is 0.5, this indicates that the first split of the rktree had occured on the covariate corresponds to the 4th column of the data frame x.train, and the exact criteria for that first split was (4th covariate value) <= 0.5 vs. (4th covariate value) > 0.5.

amount.decrease.criteria is a matrix that lists the amount of decrease in splitting criteria (Entropy or Gini Index) after each split had occurred. The first entry of amount.decrease.criteria is NA, which denotes for the condition at the root (no split). To give an example, if the second entry appear in the amount.decrease.criteria is 0.91, and if entropy was set to TRUE, this means that after the first split, the Entropy of the original node had decreased by 0.91.

num.obs is a matrix that stores the number of observations contained within a parent node prior to the split; the matrix starts with the entry "NA", in order to reflect the condition at "root". The 2nd entry of num.obs would inform us on the number of observations contained within the parent node on which the 1st split had took place while the rktree was built; the 3rd entry of the num.obs would inform us on the number of observations contained within the parent node on which the 2nd split had took place, and so on.

Usage

 construct.treeRK(x.train = data.frame(), y.new.train = c(),
                  min.num.obs.end.node.tree = 5, entropy = TRUE)
construct.treeRK(x.train = data.frame(), y.new.train = c(),
                  min.num.obs.end.node.tree = 5, entropy = TRUE)

Arguments

`x.train`	a numericized data frame of covariates of the data on which we want to build our rktree models (typically the training data); this data frame can be obtained by applying the `x.organizer` function. `x.train` should contain no `NA` or `NaN`'s.
`y.new.train`	a numericized class types of the observations from the dataset on which we want to build our rktree models (typically the training data). `y.new.train` should contain no `NA` or `NaN`'s.
`min.num.obs.end.node.tree`	the minimum number of observations that we want each end node of our rktree to contain. Default is set to '5'.
`entropy`	`TRUE` if Entropy is used as the splitting criteria; `FALSE` if Gini Index is used as the splitting criteria. Default is set to `TRUE`.

Value

A list containing the following items:

`covariate.names`	a vector of the names of all covariates that we consider in our model.
`l`	length of the hierarchical flag.
`x.node.list`	a list containing a series of children nodes produced from the numericized data frame `x.train` as the `rktree` model was building up.
`y.new.node.list`	a list containing a series of children nodes produced from the numericized vector of class type `y.new.train` as the `rktree` model was building up.
`flag`	hierchical flag that characterizes each split in the `rktree`.
`covariate.split`	a matrix that lists numericized covariates used for each split as the `rktree` was built.
`value.at.split`	a vector that lists the values at which each node of the `rktree` was split.
`amt.decrease.criteria`	a matrix that lists the amount of decrease in splitting criteria after each split as the `rktree` was built.
`num.obs`	a matrix that stores the number of observations contained in each parent node right before each split.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)
  tree.gini <- construct.treeRK(x.train, y.train,
                                min.num.obs.end.node.tree = 6, entropy = FALSE)
  tree.entropy$covariate.names
  tree.gini$flag # ...etc...
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)
  tree.gini <- construct.treeRK(x.train, y.train,
                                min.num.obs.end.node.tree = 6, entropy = FALSE)
  tree.entropy$covariate.names
  tree.gini$flag # ...etc...

Calculates Entropy or Gini Index of a node after a given split

Description

Calculates Entropy or Gini Index of a particular node after a particular split; this function is called within construct.treeRK function.

The argument split.record is a kidids_split object from the package partykit; the method kidids_split splits the data according to the criteria specified by an user ahead of time, and returns a vector storing the index of the split group (group "1" or "2") that each observation from the original data in question belongs to after the split has occurred.

For more information about the function, please see the partykit documentation.

Usage

 criteria.after.split.calculator(x.node = data.frame(), y.new.node = c(),
                                 split.record = kidids_split(),
                                 entropy = TRUE)
criteria.after.split.calculator(x.node = data.frame(), y.new.node = c(),
                                 split.record = kidids_split(),
                                 entropy = TRUE)

Arguments

`x.node`	numericized data frame of covariates (obtained via `x.organizer()`) from a particular node that is to be split; `x.node` should contain no `NA` or `NaN`'s.
`y.new.node`	numericized class type of each observation from a particular node that is to be split; `y.new.node` should contain no `NA` or`NaN`'s.
`split.record`	output of the `kidids_split` function from the `partykit` package that describes a particular split.
`entropy`	`TRUE` if Entropy is used as the splitting criteria; `FALSE` if Gini Index is used instead. Default is set to `TRUE`.

Value

The value of Entropy or Gini Index of a particular node after a particular split.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  library(forestRK) # load the package forestRK
  library(partykit)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  # numericized class types of observations of training dataset
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  ## criteria.after.split.calculator() example in the implementation
  ## of the forestRK algorithm

  ent.status <- TRUE

  # number.of.columns.of.x.node
  # = total number of covariates that we consider
  number.of.columns.of.x.node <- dim(x.train)[2]
  # m.try = the randomly chosen number of covariates that we consider
  # at the time of split
  m.try <- sample(1:(number.of.columns.of.x.node),1)
  ## sample m.try number of covariates from the list of all covariates
  K <- sample(1:(number.of.columns.of.x.node), m.try)

  # split the data
  # (the choice of the type of split used here is only arbitrary)
  # for more information about kidids_split,
  # please refer to the documentation for the package 'partykit'
  sp <- partysplit(varid=K[1], breaks = x.train[1,K[1]], index = NULL,
                   right = TRUE, prob = NULL, info = NULL)
  split.record <- kidids_split(sp, data=x.train)

  # implement critera.after.split function based on kidids_split object
  criteria.after.split <- criteria.after.split.calculator(x.train,
                                    y.train, split.record, ent.status)
  criteria.after.split
## example: iris dataset
  library(forestRK) # load the package forestRK
  library(partykit)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  # numericized class types of observations of training dataset
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  ## criteria.after.split.calculator() example in the implementation
  ## of the forestRK algorithm

  ent.status <- TRUE

  # number.of.columns.of.x.node
  # = total number of covariates that we consider
  number.of.columns.of.x.node <- dim(x.train)[2]
  # m.try = the randomly chosen number of covariates that we consider
  # at the time of split
  m.try <- sample(1:(number.of.columns.of.x.node),1)
  ## sample m.try number of covariates from the list of all covariates
  K <- sample(1:(number.of.columns.of.x.node), m.try)

  # split the data
  # (the choice of the type of split used here is only arbitrary)
  # for more information about kidids_split,
  # please refer to the documentation for the package 'partykit'
  sp <- partysplit(varid=K[1], breaks = x.train[1,K[1]], index = NULL,
                   right = TRUE, prob = NULL, info = NULL)
  split.record <- kidids_split(sp, data=x.train)

  # implement critera.after.split function based on kidids_split object
  criteria.after.split <- criteria.after.split.calculator(x.train,
                                    y.train, split.record, ent.status)
  criteria.after.split

Calculates Entropy or Gini Index of a particular node before (or without) a split

Description

Calculates the Entropy or Gini Index of a particular node before (or without) a split. This function is used inside the criteria.after.split.calculator method.

Usage

 criteria.calculator(x.node = data.frame(), y.new.node = c(),
                     entropy = TRUE)
criteria.calculator(x.node = data.frame(), y.new.node = c(),
                     entropy = TRUE)

Arguments

`x.node`	numericized data frame of covariates of a particular node (can be obtained by applying `x.organizer`) before or without a split; `x.node` should contain no `NA` or `NaN`'s.
`y.new.node`	numericized vector of class type (`y`) of a particular node (can be obtained by applying `y.organizer`) before or without split; `y.new.node` should contain no `NA` or `NaN`'s.
`entropy`	`TRUE` if Entropy is used as the splitting criteria; `FALSE` if Gini Index is used as the splitting criteria. Default is set to `TRUE`.

Value

A list containing the following items:

`criteria`	the value of the Entropy or the Gini Index of a particular node.
`ent.status`	logical value (`TRUE` or `FALSE`) of the parameter `entropy`.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

 ## example: iris dataset
 library(forestRK) # load the package forestRK

 # covariates of training data set
 x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
 # numericized class types of observations of training dataset
 y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

 ## criteria.calculator() example
 ## calculate the Entropy of the original training dataset
 criteria.calculator(x.node = x.train, y.new.node = y.train)
 ## calculate the Gini Index of the original training dataset
 criteria.calculator(x.node = x.train, y.new.node = y.train, entropy = FALSE)
## example: iris dataset
 library(forestRK) # load the package forestRK

 # covariates of training data set
 x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
 # numericized class types of observations of training dataset
 y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

 ## criteria.calculator() example
 ## calculate the Entropy of the original training dataset
 criteria.calculator(x.node = x.train, y.new.node = y.train)
 ## calculate the Gini Index of the original training dataset
 criteria.calculator(x.node = x.train, y.new.node = y.train, entropy = FALSE)

Identifies optimal cutoff point of an impure node for splitting after applying the `rk` (Random K) algorithm.

Description

Identifies optimal cutoff point of an impure dataset for splitting after applying the rk (Random K) algoritm, in terms of Entropy or Gini Index.

To give an example, if the function gives cutoff.value of 2.5, covariate.ind of 4, and cutoff.node of 23, this would inform the user that if a split is to be performed on the particular node that the user is considering, the split should occur on the 4th covariate (the actual name of this covariate would be the name of the 4th column from the original dataset), at the value of 2.5 (left child node in this case would be the group of observations that have their 4th covariate value less than or equal to 2.5, and for the right child node would be the group of observations that have their 4th covariate value greater than 2.5), and that this splitting point corresponds to the 23rd observation point of the node.

This function internally loads the packages partykit and rapportools; the package partykit is internally loaded to generate the object split.record.optimal, and the package rapportools is loaded to allow the validation of one of the stopping criteria that uses is.boolean method.

This function is ran internally in the construct.treeRK function.

Usage

 cutoff.node.and.covariate.index.finder(x.node = data.frame(),
                                        y.new.node = c(), entropy = TRUE)
cutoff.node.and.covariate.index.finder(x.node = data.frame(),
                                        y.new.node = c(), entropy = TRUE)

Arguments

`x.node`	a numericized data frame of covariates of the observations from a particular node prior to the split (can be obtained after applying `x.organizer()`); `x.node` should contain no `NA` or `NaN`'s.
`y.new.node`	a vector storing numericized class type of the observations from a particular node before the split (can be obtained after applying `y.organizer()`); `y.new.node` should contain no `NA` or `NaN`'s.
`entropy`	`TRUE` if Entropy is used as the splitting criteria; `FALSE` if Gini Index is used as the splitting criteria. Default is set to `TRUE`.

Value

A list containing the following items:

`cutoff.value`	the value at which the optimal split should take place.
`cutoff.node`	the index of the observation (observation number) at which optimal split should occur.
`covariate.ind`	numeric index of the covariate at which the optimal split should occur.
`split.record.optimal`	the `kidid_split` output of the optimal split.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # implementation of cutoff.node.and.covariate.index.finder()
  res <- cutoff.node.and.covariate.index.finder(x.train, y.train,
                                               entropy=FALSE)
  res$cutoff.value
  res$cutoff.node
  res$covariate.ind
  res$split.record.optimal
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # implementation of cutoff.node.and.covariate.index.finder()
  res <- cutoff.node.and.covariate.index.finder(x.train, y.train,
                                               entropy=FALSE)
  res$cutoff.value
  res$cutoff.node
  res$covariate.ind
  res$split.record.optimal

Creates a `igraph` plot of a `rktree`

Description

Creates a plot of a rktree that was built from the (training) dataset.

The package igraph is loaded internally when this function is called, to aid in generating the plot of a rktree.

DESCRIPTIONS OF THE rktree PLOT:

The resulting plot is a classical decision tree.

The rectangular nodes (or vertices) that contain "=<" symbol are used to describe the splitting criteria applied to that very node while constructing the rktree; for example, in the rktree plot generated by the code shown in the "examples" section below, the node with the label "Petal.Width =< 1.6" indicates that this node was split into a chunk that contains observations with Petal.Width <= 1.6 and a chunk that contains observations with Petal.Width greater than 1.6, in order to construct the rktree.

Any other rectangular nodes (or vertices) that do not contain the "=<" symbol indicate that we have reached an end node, and the text displayed in such node is the actual name of the class type that the rktree model assigns to the observations belonging to that node; for example, in the rktree plot generated by the code shown in the "examples" section below, the vertex with the label "setosa" indicates that the rktree assigns the class type "setosa" to all observations that belong to that particular node.

Usage

 draw.treeRK(tr = construct.treeRK(), y.factor.levels,
 font = "Times", node.colour = "white", text.colour = "dark blue",
 text.size = 0.67, tree.vertex.size = 75, tree.title = "Diagram of a Tree",
 title.colour = "dark blue")
draw.treeRK(tr = construct.treeRK(), y.factor.levels,
 font = "Times", node.colour = "white", text.colour = "dark blue",
 text.size = 0.67, tree.vertex.size = 75, tree.title = "Diagram of a Tree",
 title.colour = "dark blue")

Arguments

`tr`	a `construct.treeRK()` object (a tree).
`y.factor.levels`	a `y.organizer()$y.factor.levels` output.
`font`	font type used in the `rktree` plot; default is "Times".
`node.colour`	colour of the node used in the `rktree` plot; default is "White".
`text.colour`	colour of the text used in the `rktree` plot; default is "Dark Blue".
`text.size`	size of the text in the `rktree` plot; default is 0.67.
`tree.vertex.size`	size of the `rktree` plot vertices; default is 75.
`tree.title`	title of the `rktree` plot; default title is "Diagram of a Tree".
`title.colour`	colour for the title of the `rktree` plot; default title colour is "Dark Blue".

Value

An igraph plot of a rktree.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## Numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  y.factor.levels <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.factor.levels

  ## Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)

  # Plot the tree
  draw.treeRK(tree.entropy, y.factor.levels, font="Times",
              node.colour = "black", text.colour = "white", text.size = 0.7,
              tree.vertex.size = 100, tree.title = "Decision Tree",
              title.colour = "dark green")
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## Numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  y.factor.levels <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.factor.levels

  ## Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)

  # Plot the tree
  draw.treeRK(tree.entropy, y.factor.levels, font="Times",
              node.colour = "black", text.colour = "white", text.size = 0.7,
              tree.vertex.size = 100, tree.title = "Decision Tree",
              title.colour = "dark green")

Identifies numerical indices of the end nodes of a `rktree` from the matrix of hierarchical flags.

Description

Identifies numerical indices of the end nodes of a rktree by closely examining the structure of the rktree flag (obtained via construct.treeRK()$flag); the precise algorithm used is the following:

if m-th string in the list of rktree flag is a substring of one or more of (m + 1),...,n-th strings in the list of flag, then the node represented by the m-th string of the flag is not an end node; otherwise, the node represented by the m-th string of the flag is the end node.

Usage

 ends.index.finder(tr.flag = matrix())
ends.index.finder(tr.flag = matrix())

Arguments

tr.flag

a construct.treeRK()$flag object or a similar flag matrix.

Value

A vector that lists the indices of the end nodes of a given rktree (indices that are consistent to the indices in x.node.list, y.new.node.list, and flag that are returned by the construct.treeRK function).

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)

  # Find indices of end nodes of tree.entropy
  end.node.index <- ends.index.finder(tree.entropy$flag)
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)

  # Find indices of end nodes of tree.entropy
  end.node.index <- ends.index.finder(tree.entropy$flag)

Builds up a random forest RK model based on the given (training) dataset

Description

Builds up a random forest RK model onto the given (training) dataset.

The functions bstrap and construct.treeRK are used inside this function. Once the call for bstrap generates bootstrap samples of the training dataset, then the function construct.treeRK is called in order to build a tree on each of those bootstrap dataset, to form a bigger forest.

Calling of this function internally loads the package rapportools; this is to allow the use of is.boolean method to check one of the stopping criteria.

Usage

 forestRK(X = data.frame(), Y.new = c(),
          min.num.obs.end.node.tree = 5, nbags, samp.size, entropy = TRUE)
forestRK(X = data.frame(), Y.new = c(),
          min.num.obs.end.node.tree = 5, nbags, samp.size, entropy = TRUE)

Arguments

`X`	a numericized data frame storing covariates of each observation contained in the given (training) dataset (obtained via `x.organizer()`); `X` should contain no `NA` or `NaN`'s.
`Y.new`	a vector storing the numericized class types of each observation contained in the given (training) dataset `X`; `Y.new` should contain no `NA` or `NaN`'s.
`min.num.obs.end.node.tree`	the minimum number of observations that we want each end node of our `rktree` to contain. Default is set to 5.
`nbags`	number of bootstrap samples that we want to generate to generate a forest.
`samp.size`	number of observations that we want each of our bootstrap samples to contain.
`entropy`	`TRUE` if we use Entropy as the splitting criteria; `FALSE` if we use the Gini Index for the splitting criteria. Default is set to `TRUE`.

Value

A list containing the following items:

`X`	The original (training) dataset that was used to construct the random forest RK model.
`forest.rk.tree.list`	A list of trees (`construct.treeRK` objects) contained in the `forestRK` model.
`bootsamp.list`	A list containing data frames of bootstrap samples that were generated from the given (training) dataset `X`.
`ent.status`	The value of the parameter `entropy`.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # Implement forestRK function
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # normally nbags and samp.size has to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)

  # extract the first tree in the forestRK.1 model
  forestRK.1$forest.rk.tree.list[[1]]
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # Implement forestRK function
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # normally nbags and samp.size has to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)

  # extract the first tree in the forestRK.1 model
  forestRK.1$forest.rk.tree.list[[1]]

Extracts the structure of one or more trees in a forestRK object

Description

Extracts structure of one or more trees from a forestRK object.

Each tree in the list are named by the exact indices of the tree; for example, if the code obj <- get.tree.forestRK(forestRK.1, tree.index=c(4,5,6)) was used to extract the structure of the 4th, 5th, and 6th trees in the forest, the user can retrieve the information pertains explicitly to the 4th tree in the forest by doing obj["4"]].

Usage

 get.tree.forestRK(forestRK.object = forestRK(), tree.index=c())
get.tree.forestRK(forestRK.object = forestRK(), tree.index=c())

Arguments

`forestRK.object`	a `forestRK` object.
`tree.index`	a vector of indices of the trees that we want to extract from the `forestRK` object.

Value

A list containing forestRK trees that have their indices specified in the function argument tree.index.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # normally nbags and samp.size have to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)

  # get tree
  tree.index.ex <- c(1,3,8)
  get.tree <- get.tree.forestRK(forestRK.1, tree.index = tree.index.ex)
  get.tree[["8"]] # display the 8th tree of the random forest
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # normally nbags and samp.size have to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)

  # get tree
  tree.index.ex <- c(1,3,8)
  get.tree <- get.tree.forestRK(forestRK.1, tree.index = tree.index.ex)
  get.tree[["8"]] # display the 8th tree of the random forest

Calculates Gini Importance or Mean Decrease Impurity (same algorithm is used in 'scikit-learn') of each covariate that we consider in the `forestRK` model

Description

Calculates Gini Importance of each covariate considered in the forestRK model, and list them in the order of most important to the least important.

The Gini Importance or Mean Decrease in Impurity algorithm is also used in 'scikit-learn'. Gini Importance is defined as the total decrease in node impurity averaged over all trees of the ensemble, where the decrease in node impurity is obtained after weighting by the probability for an observation to reach that node (which is approximated by the proportion of samples reaching that node).

Usage

 importance.forestRK(forestRK.object = forestRK())
importance.forestRK(forestRK.object = forestRK())

Arguments

forestRK.object

a forestRK object.

Value

A list containing the following items:

`importance.covariate.names`	a vector of names of the covariates of our dataset ordered from the most important to the least important.
`average.decrease.in.criteria.vec`	a vector storing the average decrease in the weighted splitting criteria by each covariate that was calculated across all trees in the forestRK object; the numbers are ordered from the highest average decrease in criteria (importance) to the lowest, so the i-th importance number from this vector pertains to the i-th covariate listed in the vector output `importance.covariate.names`.
`ent.status`	status of the parameter `entropy`; `TRUE` if Entropy is used for splitting critera, `FALSE` if Gini Index is used instead.
`x.original`	a numericized data frame storing covariates of each observation from the given (training) dataset that was used to construct the `forestRK` object in the beginning of the `forestRK` function call.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # typically the nbags and samp.size has to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags=30, samp.size=50)
  # execute importance.forestRK function
  imp <- importance.forestRK(forestRK.1)
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # typically the nbags and samp.size has to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags=30, samp.size=50)
  # execute importance.forestRK function
  imp <- importance.forestRK(forestRK.1)

Generates importance `ggplot` of the covariates considered in the `forestRK` model

Description

Generates importance ggplot of the covariates considered in the forestRK model.

When the number of covariates under consideration is huge, it can be pretty difficult to read the covariate name from this plot. In this case, user can identify the name of the covariate that he or she is interested in by extracting importance.covariate.names from the importance.forestRK.object that was used in the function call. importance.covariate.names lists the original names of the covariates after ordering them from the most important to the least important. So for example, the exact name of the covariate that has the second highest importance would be the second element of the vector importance.covariate.names, and so on.

Usage

 importance.plot.forestRK(importance.forestRK.object = importance.forestRK(),
                          colour.used = "dark green", fill.colour = "dark green",
                          label.size = 10)
importance.plot.forestRK(importance.forestRK.object = importance.forestRK(),
                          colour.used = "dark green", fill.colour = "dark green",
                          label.size = 10)

Arguments

`importance.forestRK.object`	an `importance.forestRK` object.
`colour.used`	colour used for the border of the importance plot; default is "dark green".
`fill.colour`	colour used to fill the bars of the importance plot; default is "dark green" (yes, I like dark green).
`label.size`	size of the labels; default is set to 10.

Value

An importance plot of the covariates considered in the forestRK model, ordered from the most important covariate to the least important covariate.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # typically the nbags and samp.size has to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)
  # execute forestRK.importance function
  imp <- importance.forestRK(forestRK.1)

  # generate importance plot
  importance.plot.forestRK(imp)
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # typically the nbags and samp.size has to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)
  # execute forestRK.importance function
  imp <- importance.forestRK(forestRK.1)

  # generate importance plot
  importance.plot.forestRK(imp)

Makes 2D MDS (multidimensional scaling) `ggplot` of the test observations based on the predictions from a `forestRK` model.

Description

Plots 2D MDS (Multi-Dimensional Scaling) ggplot of the test observations based on the provided forestRK model, and each test observation is colour coded by their predicted class types.

The plot also has legends that tells user which colour pertains to which predicted class type.

The existing R functions dist and cmdscale were used in this function to compute the Multi-Dimensional Scales of the test data.

Usage

mds.plot.forestRK(pred.forestRK.object = pred.forestRK(),
 plot.title ="MDS Plot of Test Data Colour Coded by Forest RK Model Predictions",
 xlab ="First Coordinate", ylab = "Second Coordinate",
 colour.lab = "Predictions By The Random Forest RK Model")
mds.plot.forestRK(pred.forestRK.object = pred.forestRK(),
 plot.title ="MDS Plot of Test Data Colour Coded by Forest RK Model Predictions",
 xlab ="First Coordinate", ylab = "Second Coordinate",
 colour.lab = "Predictions By The Random Forest RK Model")

Arguments

`pred.forestRK.object`	a `pred.forestRK()` object.
`plot.title`	an user specified title for the mds plot; the default is "MDS Plot of Test Data Colour Coded by Forest RK Model Predictions".
`xlab`	label for the x-axis of the plot; the default is "First Coordinate".
`ylab`	label for the y-axis of the plot; the default is "Second Coordinate".
`colour.lab`	label title for the legend that specifies categories for each colour; the default is "Predictions By The Random Forest RK Model".

Value

A multidimensional scaling ggplot (2D) of the test observations, colour coded by their predicted class types.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  x.test <- x.organizer(iris[,1:4], encoding = "num")[c(26:50,76:100,126:150),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  y.factor.levels <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.factor.levels

  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # typically the nbags and samp.size has to be much larger than 30 and 50
  pred.forest.rk <- pred.forestRK(x.test = x.test,
                                  x.training = x.train, y.training = y.train,
                                  nbags = 30, samp.size = 50,
                                  y.factor.levels = y.factor.levels)

  # generate a classical mds plot of test observations
  # and colour code them by the predicted class
  mds.plot.forestRK(pred.forest.rk)
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  x.test <- x.organizer(iris[,1:4], encoding = "num")[c(26:50,76:100,126:150),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new
  y.factor.levels <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.factor.levels

  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # typically the nbags and samp.size has to be much larger than 30 and 50
  pred.forest.rk <- pred.forestRK(x.test = x.test,
                                  x.training = x.train, y.training = y.train,
                                  nbags = 30, samp.size = 50,
                                  y.factor.levels = y.factor.levels)

  # generate a classical mds plot of test observations
  # and colour code them by the predicted class
  mds.plot.forestRK(pred.forest.rk)

Make predictions on the test data based on the forestRK model constructed from the training data

Description

Makes predictions on the test dataset based on the forestRK model constructed from the training dataset.

Please be aware that, the test data points in test.prediction.df.list , pred.for.obs.forest.rk, and num.pred.for.obs.forest.rk are re-ordered by the increasing original index number (the original rownames) of those test observations. So if you shuffled the data before seperating them into a training and a test set, the order of the data points in which they are presented under the attribute test.prediction.df.list, pred.for.obs.forest.rk, and num.pred.for.obs.forest.rk may not be same as the shuffled order of your original test set.

Calling of this function internally loads the package rapportools; this is to allow the use of is.boolean method to check one of the stopping criteria in the beginning.

The basic mechanism behind pred.forestRK function is the following:

When the function is called, it calls forestRK function after passing the user-specified training data as an argument, in order to first generate the forestRK object. After that, the function uses pred.treeRK function to make predictions on the test observations based on each individual tree in the forestRK object. Once the individual prediction from each tree are obtained for all of the test observations, the function stores those individual predictions under a big dataframe. Once that data frame is complete, then the function collapses the results by the rule of the majority votes. For example, for the m-th observation from the test set, if the most frequently predicted class type for that m-th test observation by all of the rkTrees in the forest is class type 'A', then by the rule of the majority votes, the pred.forestRK function will assign class 'A' as the predicted class type for that m-th test observation based on the forestRK model.

Usage

  pred.forestRK(x.test = data.frame(), x.training = data.frame(),
                y.training = c(), y.factor.levels,
                min.num.obs.end.node.tree = 5,
                nbags, samp.size, entropy = TRUE)
pred.forestRK(x.test = data.frame(), x.training = data.frame(),
                y.training = c(), y.factor.levels,
                min.num.obs.end.node.tree = 5,
                nbags, samp.size, entropy = TRUE)

Arguments

`x.test`	a numericized data frame of covariates of the data points on which we want to make our predictions (typically the test observations); `x.test` can be obtained by applying the `x.organizer()` function. `x.test` should contain no `NA` or `NaN`'s.
`x.training`	a numericized data frame of covariates of data points from which we build our `forestRK` model (typically the training observations); `x.training` can be obtained by applying the `x.organizer()` function. `x.training`should contain no `NA` or `NaN`'s.
`y.training`	a vector that stores numericized class types of the training data points; `y.training` should contain no `NA` or `NaN`'s.
`min.num.obs.end.node.tree`	the minimum number of observations that we want each end node of our `rktree` to contain. Default is set to 5.
`nbags`	the number of bootstrap samples that we want to generate to form a `forest-RK`.
`samp.size`	the number of data points that we want each of our bootstrap sample to contain.
`y.factor.levels`	a vector of original names of all class types that the user considers in his or her study (can be obtained via `y.organizer()$y.factor.levels`)
`entropy`	`TRUE` if we use Entropy as the splitting criteria; `FALSE` if we use the Gini Index for the splitting criteria. Default is set to `TRUE`.

Value

A list containing the following items:

`x.test`	the original test dataset that we used to make predictions.
`df.of.predictions.for.all.observations`	a data frame storing predicted class types for all test observations from each tree in the forest; each row of this data frame pertains to individual test observation, and each column pertain to a specific tree from the `forestRK` model. This data frame stores predicted (numericized) class type of each test observation from each tree in the `forestRK` model.
`forest.rk`	a `forestRK` object that was generated in the beginning of the function call.
`test.prediction.df.list`	a list of data frames storing the `prediction.df`'s (the data frame that can be obtained via `pred.treeRK()$prediction.df`) of the test observations that were generated from each tree in the `forestRK` model. Note that the test data points in `test.prediction.df.list` are re-ordered by the increasing original observation index number.
`pred.for.obs.forest.rk`	a vector that stores the actual predicted class labels of the test observations instead of their numericized (integer) class types. Note that the test data points in `pred.for.obs.forest.rk` are re-ordered by the increasing original observation index number.
`num.pred.for.obs.forest.rk`	the numericized version of `pred.for.obs.forest.rk`. Note that the test data points in `num.pred.for.obs.forest.rk` are re-ordered by the increasing original observation index number.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  x.test <- x.organizer(iris[,1:4], encoding = "num")[c(26:50,76:100,126:150),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  y.factor.levels <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.factor.levels

  ## make prediction from a random forest RK model
  ## typically the nbags and samp.size has to be much larger than 30 and 50
  pred.forest.rk <- pred.forestRK(x.test = x.test, x.training = x.train,
                                  y.training = y.train,
                                  y.factor.levels,
                                  min.num.obs.end.node.tree = 6,
                                  nbags = 30, samp.size = 50, entropy = FALSE)
  pred.forest.rk$test.prediction.df.list[[10]]
  pred.forest.rk$pred.for.obs.forest.rk # etc....
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  x.test <- x.organizer(iris[,1:4], encoding = "num")[c(26:50,76:100,126:150),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  y.factor.levels <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.factor.levels

  ## make prediction from a random forest RK model
  ## typically the nbags and samp.size has to be much larger than 30 and 50
  pred.forest.rk <- pred.forestRK(x.test = x.test, x.training = x.train,
                                  y.training = y.train,
                                  y.factor.levels,
                                  min.num.obs.end.node.tree = 6,
                                  nbags = 30, samp.size = 50, entropy = FALSE)
  pred.forest.rk$test.prediction.df.list[[10]]
  pred.forest.rk$pred.for.obs.forest.rk # etc....

Make predictions on the test observations based on a rktree model

Description

Makes predictions on the observations in the test dataset based on the rktree model constructed from the training dataset.

Please be aware that, at the end of the pred.treeRK function, the test data points in prediction.df are re-ordered by the increasing original index number (the original rownames) of those test observations. So if you shuffled the data before seperating them into a training and a test set, the order of the data points in which they are presented under the data frame prediction.df may not be same as the shuffled order in your original test set.

Users of this function may be interested in identifying the original name of the numericized predicted class type shown in the last column of data frame prediction.df. This can easily be done by extracting the attribute y.factor.levels from the y.organizer object. For example, if the data frame prediction.df indicates that the predicted class type of the 1st test observation is "2", that means the actual name of the predicted class type for that 1st test observation is indicated as the 2nd element of the vector y.organizer.object$y.factor.levels that we can obtain during the data cleaning phase.

The pred.treeRK function makes a use of the list of hierarchical flags generated by the construct.treeRK function; the function uses the list of hierarchical flag as a guide to how it should split the test set to make predictions. The function pred.treeRK itself actually generates a list of hierarchical flag of its own as it splits the test set, and at the end of the function pred.treeRK tries to match the list of hierarchical flag it generated with the list of hierarchical flag from the construct.treeRK function. If the two flags match exactly, then it is a good sign since this would imply that the splitting on the test set was done in the manner consistent with how the training set was split when the rkTree in question was built. If there is any difference in the two flags, however, this is not a good sign since it would signal that the splitting on the test set has done in a different manner than how the splitting was done on the training set; if the mismatch occurs, the pred.treeRK function will stop and throw an error. For more information about the hierarchical flags of a rkTree, please see the construct.treeRK section of this documentation.

Usage

 pred.treeRK(X = data.frame(), rktree = construct.treeRK())
pred.treeRK(X = data.frame(), rktree = construct.treeRK())

Arguments

`X`	a numericized data frame of covariates of the test observations or the observations that we want to make predictions for (obtained via `x.organizer()`). `X` should contain no `NA` or `NaN`'s.
`rktree`	a `construct.treeRK` object.

Value

A list containing the following items:

prediction.df

a data frame of test observations. If prediction.df has n columns, the first n-1 columns will contain the numericized covariates of the test observations, and the very last n-th column will contain the predicted numericized class type for each of those test observations. Note that, at the end of the pred.treeRK function, the test data points in prediction.df are re-ordered by theincreasing original observation index number.

flag.pred

the hierarchical flag of splits performed on the test set by applying the rktree model in question.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  x.test <- x.organizer(iris[,1:4], encoding = "num")[c(26:50,76:100,126:150),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  ## Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)
  tree.gini <- construct.treeRK(x.train, y.train,
                                min.num.obs.end.node.tree = 6, entropy = FALSE)

  ## Make predictions on the test set based on the constructed rktree model
  # last column of prediction.df stores predicted class on the test observations
  # based on a given rktree
  prediction.df <- pred.treeRK(X = x.test, tree.entropy)$prediction.df
  flag.pred <- pred.treeRK(X = x.test, tree.entropy)$flag.pred
## example: iris dataset
  ## load the forestRK package
  library(forestRK)

  ## numericize the data
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  x.test <- x.organizer(iris[,1:4], encoding = "num")[c(26:50,76:100,126:150),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  ## Construct a tree
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  tree.entropy <- construct.treeRK(x.train, y.train)
  tree.gini <- construct.treeRK(x.train, y.train,
                                min.num.obs.end.node.tree = 6, entropy = FALSE)

  ## Make predictions on the test set based on the constructed rktree model
  # last column of prediction.df stores predicted class on the test observations
  # based on a given rktree
  prediction.df <- pred.treeRK(X = x.test, tree.entropy)$prediction.df
  flag.pred <- pred.treeRK(X = x.test, tree.entropy)$flag.pred

Extract the list of covariates used to perform the splits to generate a particular tree(s) in a `forestRK` object

Description

Spits out the list of covariates used to perform the splits to generate a particular tree(s) in a forestRK object that the user provided.

The function extracts the list of names of covariates used in splits to construct a single or a multiple numbers of trees from a forestRK object. The var.used.forestRK displays the actual name of the covariate used for each split (not their numericized ones), consistent to the exact order of the split; for instance, the 1st element of the vector covariate.used.for.split.tree[["6"]] from the example below is the covariate on which the 1st split had occured while the 6th tree in the forestRK.1 object was built.

Each vector in the list are named by the exact indices of the tree; for example, if the code obj <- var.used.forestRK(forestRK.1, tree.index=c(4,5,6)) is used to extract the list of covariates used for splitting to construct 4th, 5th, and 6th trees in the forest, and the user can retrieve the information pertains explicitly to the 6th tree in the forest by doing obj[["6"]].

Usage

 var.used.forestRK(forestRK.object = forestRK(), tree.index = c())
var.used.forestRK(forestRK.object = forestRK(), tree.index = c())

Arguments

`forestRK.object`	a `forestRK` object.
`tree.index`	a vector storing the indices of the trees that we are interested to examine.

Value

A list of vectors that stores the names of covariates on which each split was performed to construct the specific tree(s) in a forestRK model that the user provided.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  library(forestRK)

  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # normally nbags and samp.size have to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)

  # prediction from a random forest RK
  covariate.used.for.split.tree <- var.used.forestRK(forestRK.1,
                                                     tree.index=c(4,5,6))

  # retrieve the list of covariates used for splitting for the 'tree #6'
  covariate.used.for.split.tree[["6"]]
library(forestRK)

  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
  y.train <- y.organizer(iris[c(1:25,51:75,101:125),5])$y.new

  # random forest
  # min.num.obs.end.node.tree is set to 5 by default;
  # entropy is set to TRUE by default
  # normally nbags and samp.size have to be much larger than 30 and 50
  forestRK.1 <- forestRK(x.train, y.train, nbags = 30, samp.size = 50)

  # prediction from a random forest RK
  covariate.used.for.split.tree <- var.used.forestRK(forestRK.1,
                                                     tree.index=c(4,5,6))

  # retrieve the list of covariates used for splitting for the 'tree #6'
  covariate.used.for.split.tree[["6"]]

Numericizing a data frame of covariates from the original dataset via Binary or Numeric Encoding

Description

Takes the original data frame of covariates as an input (which may or may not be numeric), and converts it into a numericized data frame by applying either Binary or Numeric Encoding.

Binary Encoding for categorical features are recommended for tree ensembles when the cardinality of categorical feature is >= 1000; Numeric Encoding for categorical features are recommended for tree ensembles when the cardinality of categorical features is < 1000.

For more information about the Binary and Numeric Encoding and their effectiveness under different cardinality, please visit: https://medium.com/data-design/ visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

NOTE: In order to use other functions within the forestRK package, you must ensure that the numericized data frame of covariates (the x.organizer object) contains no missing record, that is, you have to remove any record containing NA or NaN prior to applying the x.organizer function.

Following is the summary of the data cleaning process with x.organizer():

1. remove all NA or NaN's from the data in hand. 2. split the data into a data frame that contains covariates of ALL data points, (BOTH training and test observations), and a vector that contains class types of the training observations; 3. apply the x.organizer to the big data frame of covariates of all observations. 4. split the x.organizer output into a training and a test set, as needed.

PROPER DATA CLEANING IS ABSOLUTELY NECESSARY FOR forestRK FUNCTIONS TO WORK!

Usage

 x.organizer(x.dat = data.frame(), encoding = c("num","bin"))
x.organizer(x.dat = data.frame(), encoding = c("num","bin"))

Arguments

`x.dat`	a data frame storing covariates of each observation (can be either numeric or non-numeric) from the original data; `x.dat` should contain no `NA` or `NaN`. The rownames of `x.dat` should be numerical index for each observations.
`encoding`	type of encoding done for the categorical features; "`num`" stands for Numeric Encoding, and "`bin`" stands for Binary Encoding. When the data in question only has numeric features, then the user can select either one of "`num`" or "`bin`", and the `x.organizer` function will just return the original numeric dataset.

Value

A numericized data frame of the covariates from the original data obtained via either Numeric or Binary Encoding.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  library(forestRK) # load the package forestRK

  ## Basic Procedures
  ## 1. Apply x.organizer to a data frame that stores covariates of
  ## ALL observations (BOTH training and test observations)
  ## 2. Split the output from 1 into a training and a test set, as needed

  # note: iris[,1:4] are the columns of the iris dataset that stores
  # covariate values

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]
## example: iris dataset
  library(forestRK) # load the package forestRK

  ## Basic Procedures
  ## 1. Apply x.organizer to a data frame that stores covariates of
  ## ALL observations (BOTH training and test observations)
  ## 2. Split the output from 1 into a training and a test set, as needed

  # note: iris[,1:4] are the columns of the iris dataset that stores
  # covariate values

  # covariates of training data set
  x.train <- x.organizer(iris[,1:4], encoding = "num")[c(1:25,51:75,101:125),]

Numericize the vector containing categorical class type(`y`) of the original data

Description

Numericizes a vector of categorical class type of each (training) data point.

NOTE: In order to use other functions within the forestRK package, you must ensure that the original vector of class type y contains no missing record (NA, NaN), that is, you have to remove any record containing NA or NaN prior to applying the y.organizer function.

Following is the summary of the data cleaning process with y.organizer(): 1. remove all NA or NaN's from the dataset in hand. 2. split the training dataset into a data frame that contains covariates of ALL observations (BOTH training and test observations), and a vector that contains class types of the training observations; 3. apply the y.organizer to the vector that contains class type of each training observation.

PROPER DATA CLEANING IS NECESSARY FOR THE forestRK FUNCTIONS TO WORK!

Usage

 y.organizer(y = c())
y.organizer(y = c())

Arguments

`y`	a vector containing the class type of each observation from the dataset on which we want to build our rktree models (the training dataset); `y` should contain no `NA` or `NaN`.

Value

A list containing the following items:

`y.new`	a vector containing numericized class type of each observation from the dataset from which our rktree models are generated from. (these are typically the observations from the training set)
`y.factor.levels`	a vector storing original names of the numericized class types.

Author(s)

Hyunjin Cho, [email protected] Rebecca Su, [email protected]

Examples

  ## example: iris dataset
  ## load the package forestRK
  library(forestRK)

  ## Basic Procedures:
  ## 1. Extract the portion of the data that stores class type of each
  ##    TRAINING observation, and make it as a vector
  ## 2. apply y.organizer function to the vector obtained from 1

  y.train <- y.organizer(as.vector(iris[c(1:25,51:75,101:125),5]))
  ## retrieves the original names of each class type, if the class names
  ## were originally non-numeric
  y.train$y.factor.levels
  ## retrieves the numericized vector that stores classification category
  y.train$y.new
## example: iris dataset
  ## load the package forestRK
  library(forestRK)

  ## Basic Procedures:
  ## 1. Extract the portion of the data that stores class type of each
  ##    TRAINING observation, and make it as a vector
  ## 2. apply y.organizer function to the vector obtained from 1

  y.train <- y.organizer(as.vector(iris[c(1:25,51:75,101:125),5]))
  ## retrieves the original names of each class type, if the class names
  ## were originally non-numeric
  y.train$y.factor.levels
  ## retrieves the numericized vector that stores classification category
  y.train$y.new

Package 'forestRK'

Help Index

Performs bootstrap sampling of the (training) dataset

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Constructs a classification tree on the (training) dataset, by implementing the RK (Random 'K') algorithm

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Calculates Entropy or Gini Index of a node after a given split

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Calculates Entropy or Gini Index of a particular node before (or without) a split

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Identifies optimal cutoff point of an impure node for splitting after applying the rk (Random K) algorithm.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Creates a igraph plot of a rktree

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Identifies numerical indices of the end nodes of a rktree from the matrix of hierarchical flags.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Builds up a random forest RK model based on the given (training) dataset

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Extracts the structure of one or more trees in a forestRK object

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Calculates Gini Importance or Mean Decrease Impurity (same algorithm is used in 'scikit-learn') of each covariate that we consider in the forestRK model

Description

Usage

Arguments

Value

Author(s)

Identifies optimal cutoff point of an impure node for splitting after applying the `rk` (Random K) algorithm.

Creates a `igraph` plot of a `rktree`

Identifies numerical indices of the end nodes of a `rktree` from the matrix of hierarchical flags.

Calculates Gini Importance or Mean Decrease Impurity (same algorithm is used in 'scikit-learn') of each covariate that we consider in the `forestRK` model

Generates importance `ggplot` of the covariates considered in the `forestRK` model

Makes 2D MDS (multidimensional scaling) `ggplot` of the test observations based on the predictions from a `forestRK` model.

Extract the list of covariates used to perform the splits to generate a particular tree(s) in a `forestRK` object

Numericize the vector containing categorical class type(`y`) of the original data