# LO 74.2: Explain and assess different tools and techniques for manipulating and

LO 74.2: Explain and assess different tools and techniques for manipulating and analyzing big data.
Using big data to make predictions is the focus of machine learning. This science may utilize regression if a linear relationship is present. Machine learning might deploy tools, such as classification and regression trees, cross-validation, conditional inference trees, random forests, and penalized regression, if a nonlinear relationship exists.
Classification can be thought of as a binomial decision tree. For example, someone either survived the tragedy of the Titanic or they did not. This can be organized as a discrete variable regression where the values are either 0 or 1. This is essentially a logit regression and the output is shown in Figure 1.
Figure 1: Logistic Regression of Titanic Survival vs. Age2
Coefficients Intercept Age
Estimate 0.465 -0.002
Standard Error
0.035 0.001
t-Stat 13.291 -1.796
P-value 0.000 0.072
The logit regression results in Figure 1 show that age was not a significant factor in determining survival of Titanic passengers. Perlich, Provost, and Simonoff (2003) find that while logit regression can work very well for smaller datasets, larger pools of data require classification and regression tree (CART) analysis.2 3 Figure 2 shows a CART for the Titanic using two factors: age and cabin class, and Figure 3 shows the rules used in developing this CART.
2. Hal R. Varian, Big Data: New Tricks for Econometrics, Journal of Economic Perspectives 28, no.
2: 3-28, www.aeaweb.org/articlesPidM0.1257/jep.28.2.3.
3. Claudia Perlich, Foster Provost, and Jeffrey S. Simonoff, Tree Induction vs. Logistic Regression:
A Learning-Curve Analysis, Journal of Machine Learning Research (June 2003): 211-255, www.jmlr.org/papers/volume4/perlich03a/perlich03a.pdf.
2018 Kaplan, Inc.
Page 173
Topic 74 Cross Reference to GARP Assigned Reading – Varian
Figure 2: A Classification Tree for Survivors of the Titanic4
class >
no
Figure 3: Titanic Tree Model in Rule Form4
Features Class 3 Class 1-2, younger than 16 Class 2, older than 16 Class 1, older than 16
Predicted
Actual/Total
Died Lived Died Lived
370/501 34/36 145/233 1741276
Classification and regression trees can be very useful in explaining complex and non- linear relationships. In the case of the Titanic, CART analysis shows that both age and cabin classification were good predictors of survival rates. This can be further dissected in Figure 4, which shows the fraction of those who survived organized into age bins.
Hal R. Varian, Big Data: New Tricks for Econometrics, Journal of Economic Perspectives 28, no. 2: 3-28, www.aeaweb.org/articles?id=10.1257/jep.28.2.3.
Page 174
2018 Kaplan, Inc.
Figure 4: Titanic Survival Rates by Age Bin5
Topic 74 Cross Reference to GARP Assigned Reading – Varian
Figure 4 clearly shows that those in the lowest age bracket (children) had the highest survival rates, and that those in their 70s had the lowest. For those in between these age markers, their attained age did not really impact their survival rates. Raw age mattered less than whether a person was either a child or elderly. This process enables researchers to think dynamically about relationships in large datasets.
One concern with using this process is that trees tend to overfit the data, meaning that out- of-sample predictions are not as reliable as those that are in-sample. One potential solution for overfitting is cross-validation. In a &-fold cross validation, the larger dataset is broken up into k number of subsets (also called folds). A large dataset might be broken up into 10 smaller pools of data.
This process starts with fold 1 being a testing set and folds 2-10 being training sets. Researchers would look for statistical relationships in all training sets and then use fold 1 to test the output to see if it has predictive use. They would then repeat this process k times such that each fold takes a turn being the testing set. The results are ultimately averaged from all tests to find a common relationship. In this way, researchers can test their predictions on an out-of-sample dataset that is actually a part of the larger dataset.
.Another step that could be taken is to prune the tree by incorporating a tuning parameter (A.) that reduces the complexity in the data and ultimately minimizes the out-of-sample
5. Hal R. Varian, Big Data: New Tricks for Econometrics, Journal o f Economic Perspectives 28, no.
2: 3-28, www.aeaweb.org/articlesPidH0.1257/jep.28.2.3.
2018 Kaplan, Inc.
Page 175
Topic 74 Cross Reference to GARP Assigned Reading – Varian
errors. However, building a conditional inference tree (ctree) is an option that does not require pruning with tuning parameters. The ctree process involves the following steps: 1. Test if any independent variables are correlated with the dependent (response) variable,
and choose the variable with the strongest correlation.
2. Split the variable (a binary split) into two data subsets.
3. Repeat this process until you have isolated the variables into enough unique
components (each one is called either a node or a leaf on the ctree) that correlations have fallen below pre-defined levels of statistical significance.
The main idea of a ctree is to isolate predictors into the most specific terms possible. Consider research conducted by Munnell, Tootell, Browne, and McEneaney (1996) that studies mortgage lending in Boston to test whether ethnicity plays a role in mortgage application success6 7. Their logistic regression finds a statistically significant relationship between being declined for a mortgage and being African American. When this data is analyzed using a ctree, as shown in Figure 3, it becomes more apparent that the true cause of mortgage application failure in this dataset is being denied mortgage insurance (dmi in Figure 3) not simply being African American (black in Figure 5). A separate test would be useful to see if being denied mortgage insurance is correlated with ethnicity.
Figure 5: Ctree for Mortgage Application Success in Boston7
Constructing random forests is also a way to improve predictions from large datasets. This method uses bootstrapping to grow multiple trees from a large dataset. Using random forests to average many small models produces very good out-of-sample fits even when dealing with nonlinear data. Computers have made this method much more viable as
6. Alicia H. Munnell et al., Mortgage Lending in Boston: Interpreting HMDA Data, The
American Economic Review 86, no. 1 (March 1996): 25-53, www.jstor.org.ezaccess.libraries.psu. edu/stable/pdf/2118254.pdf.
7. Hal R. Varian, Big Data: New Tricks for Econometrics, Journal o f Economic Perspectives 28, no.
2: 3-28, www.aeaweb.Org/articlesPkLT0.1257/jep.28.2.3.
Page 176
2018 Kaplan, Inc.
Topic 74 Cross Reference to GARP Assigned Reading – Varian
sometimes thousands of trees can be grown in a random forest. There are four steps to creating random forests: 1. Select a bootstrapped sample (with replacement) out of the full dataset and grow a tree.
2. At each node on the tree, select a random sample of predictors for decision-making. No
pruning is needed in this process.
3. Repeat this process multiple times to grow a forest of trees.
4. Use each tree to classify a new observation and choose the ultimate classification based
on a majority vote from the forest.
Researchers might also use penalized regression, where a penalty term (A,) is applied to adjust the regression results. Consider a multivariate regression where we predict^ as a linear function of a constant, bQ., with P predictor variables:
This form of penalized regression is known as LASSO (least absolute shrinkage and selection operator) regression. The LASSO process improves upon OLS regression by using the penalty term (X) to limit the sum of model parameters. As lambda (X) increases, some of the regression coefficients will be driven to zero and drop out of consideration. This penalizing process enables researchers to focus on the variables that are most likely to be strong predictors. If lambda is zero, then you just have OLS regression, but as lambda increases model variance decreases.
C o l l a b o r a t io n b e t w e e n Ec o n o m e t r ic s a n d M a c h in e Le a r n in g