Title: | Aggregate Numeric, Date and Categorical Variables |
---|---|

Description: | Convenience functions for aggregating a data frame or data table. Currently mean, sum and variance are supported. For Date variables, the recency and duration are supported. There is also support for dummy variables in predictive contexts. Code has been completely re-written in data.table for computational speed. |

Authors: | Matthias Bogaert, Michel Ballings, Dirk Van den Poel |

Maintainer: | Matthias Bogaert <[email protected]> |

License: | GPL (>= 2) |

Version: | 0.1.1 |

Built: | 2024-09-25 03:29:07 UTC |

Source: | https://github.com/matthbogaert/aggregater |

- Aggregate numeric, Date and categorical variables
- Extraction of Categorical Values as a Preprocessing Step for Making Dummy Variables
- Fast-automatic Dummy Variable Creation with Support for Predictive Contexts

The `Aggregate`

function (not to be confounded with aggregate) prepares a data frame or data table for merging by computing the sum, mean and variance of all continuous (integer and numeric) variables by a given variable. For all categorical variabes (character and factor), it creates dummies and subsequently computes the sum and the mode by a given variable. For all Date variables, it computes the recency and duration by a given variable with repsect the an end date variable. For computational speed, all the calculations are done with `data.table`

. This functions aims at maximum information extraction with a minimum amount of code.

`Aggregate( x, by, end_ind = Sys.Date(), format = "%Y-%m-%d", tibble = FALSE, verbose = TRUE, object = NULL, p = "all" )`

`Aggregate( x, by, end_ind = Sys.Date(), format = "%Y-%m-%d", tibble = FALSE, verbose = TRUE, object = NULL, p = "all" )`

`x` |
A data frame or data table. Categorical variables have to be of type character or factor and continuous variables have to be of type integer or numeric. Date variables should be in the Date format. |

`by` |
A character string specifying the variable on which to aggregate the results. Note that 'by' should be a variable of the table 'x'. |

`end_ind` |
A Date object, or something which can be coerced by |

`format` |
A character string. If not specified, the ISO 8601 international standard which expresses a day "%Y-%m-%d" is taken. |

`tibble` |
Should the output be a tibble, data frame or data table? By default, the function returns a data frame or data table depending on the input. To return a tibble, the user must set the tibble = TRUE. |

`verbose` |
indicator Used to show the progress. |

`object` |
Parameter related to the |

`p` |
Parameter related to the |

A data frame, data table or tibble with the aforementioned variables aggregated by the given ID variables. If the input is a data frame, a data frame is returned else a data table is returned.

Authors: Matthias Bogaert, Michel Ballings, Dirk Van den Poel, Maintainer: [email protected]

`# Example # Create some data data <- data.frame(V1=sample(as.factor(c('yes','no')), 200000, TRUE), V2=sample(as.character(c(1,2,3,4,5)),200000, TRUE), V3=sample(1:20000,200000, TRUE), V4=sample(300:1000, 200000, TRUE), V5 = sample(as.Date(as.Date('2014-12-09'):Sys.Date()-1, origin = "1970-01-01"),200000,TRUE), ID=sample(x = as.character(1:4), size = 200000, replace = TRUE)) Aggregate(x=data,by='ID') # Examples of how to use the object and p argument. See dummy and categories function for details. # Aggregate(x=data,by='ID',object=categories(data)) # Aggregate(x=data,by='ID',p=2)`

`# Example # Create some data data <- data.frame(V1=sample(as.factor(c('yes','no')), 200000, TRUE), V2=sample(as.character(c(1,2,3,4,5)),200000, TRUE), V3=sample(1:20000,200000, TRUE), V4=sample(300:1000, 200000, TRUE), V5 = sample(as.Date(as.Date('2014-12-09'):Sys.Date()-1, origin = "1970-01-01"),200000,TRUE), ID=sample(x = as.character(1:4), size = 200000, replace = TRUE)) Aggregate(x=data,by='ID') # Examples of how to use the object and p argument. See dummy and categories function for details. # Aggregate(x=data,by='ID',object=categories(data)) # Aggregate(x=data,by='ID',p=2)`

`categories`

stores all the categorical values that are present in the factors and character vectors of a data frame. Numeric and integer vectors are ignored. It is a preprocessing step for the `dummy`

function. This function is appropriate for settings in which the user only wants to compute dummies for the categorical values that were present in another data set. This is especially useful in predictive modeling, when the new (test) data has more or other categories than the training data.

`categories(x, p = "all")`

`categories(x, p = "all")`

`x` |
data frame or data table containing factors or character vectors that need to be transformed to dummies. Numerics, dates and integers will be ignored. |

`p` |
select the top p values in terms of frequency. Either "all" (all categories in all variables), an integer scalar (top p categories in all variables), or a vector of integers (number of top categories per variable in order of appearance. |

A list containing the variable names and the categories

Authors: Michel Ballings, and Dirk Van den Poel, Maintainer: [email protected]

`#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) categories(x=traindata,p="all") categories(x=traindata,p=2) categories(x=traindata,p=c(2,1,3))`

`#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) categories(x=traindata,p="all") categories(x=traindata,p=2) categories(x=traindata,p=c(2,1,3))`

`dummy`

creates dummy variables of all the factors and character vectors in a data frame or data table. It also supports settings in which the user only wants to compute dummies for the categorical values that were present in another data set. This is especially useful in the context of predictive modeling, in which the new (test) data has more or other categories than the training data.For computational speed, the code is written in `data.table`

.

`dummy(x, p = "all", object = NULL, num = TRUE, verbose = FALSE, ref = FALSE)`

`dummy(x, p = "all", object = NULL, num = TRUE, verbose = FALSE, ref = FALSE)`

`x` |
a data frame or data table containing at least one factor or character vector |

`p` |
Only relevant if object is NULL. Select the top p values in terms of frequency. Either "all" (all categories in all variables), an integer scalar (top p categories in all variables), or a vector of integers (number of top categories per variable in order of appearance). |

`object` |
output of the |

`num` |
should the dummies be of class numeric (TRUE) or factor (FALSE). Setting this to TRUE will speed up execution considerably. |

`verbose` |
logical. Used to show progress. Does not work when |

`ref` |
logical. Only relevant when x is a data.table. If TRUE x will be overwritten by the dummy output (called transformed x), and a reference (i.e., not a copy) to the transformed x will be returned invisibly. If FALSE, x will be left untouched, and the output will be returned as usual. The difference between ref=TRUE and ref=FALSE is that the former uses less memory equal to the amount of the original x (not transformed x). If x=TRUE only the transformed x survives the function. If x=FALSE both the original x and the output (equal in size as transformed x) will survive. The difference is hence the size of the original x, and therefore ref=TRUE is more memory efficient. |

A data frame or data table containing dummy variables. If ref=TRUE then the output will be invisible and x will contain the output. NOTE: data.table currently has a print bug. In some cases the output does not print. Running the output object multiple times or running it once with [] appended will make it print. In either case, the output will be produced. str() also always works.

Authors: Michel Ballings, and Dirk Van den Poel, Maintainer: [email protected]

`#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) #create dummies of training set (dummies_train <- dummy(x=traindata)) #create dummies of new set (dummies_new <- dummy(x=newdata)) #how many new dummy variables should not have been created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of new set using categories found in training set (dummies_new <- dummy(x=newdata,object=categories(traindata,p="all"))) #how many new dummy variables should not have be created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of training set, #using the top 2 categories of all variables found in the training data dummy(x=traindata,p=2) #create dummies of training set, #using respectively the top 2,3 and 1 categories of the three #variables found in training data dummy(x=traindata,p=c(2,3,1)) #create all dummies of training data dummy(x=traindata) ## Not run: ####################### #example ref parameter #ref=TRUE, example 1 (DT = data.table(a=c("a","b"),b=c("c","c"))) dummy(DT,ref=TRUE) DT[] #DT has changed #ref=TRUE, example 2 #uses exactly same amount of memory as example 1 (DT = data.table(a=c("a","b"),b=c("c","c"))) d1 <- dummy(DT,ref=TRUE) DT[] #DT has changed d1[] #d1 is a reference (not a copy) to DT #ref=FALSE, example 3 #example 1 and 2 are more memory efficient than example 3 (DT = data.table(a=c("a","b"),b=c("c","c"))) d2 <- dummy(DT, ref=FALSE) DT[] #DT has not changed d[] # deleting DT after dummy finishes would result in the same final # memory footprint as example 1 and 2, except that in example 3 # memory usage is higher when dummy is being executed, and this may be # problematic when DT is large. ## End(Not run)`

`#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) #create dummies of training set (dummies_train <- dummy(x=traindata)) #create dummies of new set (dummies_new <- dummy(x=newdata)) #how many new dummy variables should not have been created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of new set using categories found in training set (dummies_new <- dummy(x=newdata,object=categories(traindata,p="all"))) #how many new dummy variables should not have be created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of training set, #using the top 2 categories of all variables found in the training data dummy(x=traindata,p=2) #create dummies of training set, #using respectively the top 2,3 and 1 categories of the three #variables found in training data dummy(x=traindata,p=c(2,3,1)) #create all dummies of training data dummy(x=traindata) ## Not run: ####################### #example ref parameter #ref=TRUE, example 1 (DT = data.table(a=c("a","b"),b=c("c","c"))) dummy(DT,ref=TRUE) DT[] #DT has changed #ref=TRUE, example 2 #uses exactly same amount of memory as example 1 (DT = data.table(a=c("a","b"),b=c("c","c"))) d1 <- dummy(DT,ref=TRUE) DT[] #DT has changed d1[] #d1 is a reference (not a copy) to DT #ref=FALSE, example 3 #example 1 and 2 are more memory efficient than example 3 (DT = data.table(a=c("a","b"),b=c("c","c"))) d2 <- dummy(DT, ref=FALSE) DT[] #DT has not changed d[] # deleting DT after dummy finishes would result in the same final # memory footprint as example 1 and 2, except that in example 3 # memory usage is higher when dummy is being executed, and this may be # problematic when DT is large. ## End(Not run)`