LO 74.3: Examine the areas for collaboration between econometrics and machine | Ken Li, FRM

LO 74.3: Examine the areas for collaboration between econometrics and machine learning.
There are several different areas where useful collaboration could exist between econometrics and machine learning. Most machine learning assumes that data is independently and identically distributed and most datasets are cross-sectional data. In practice, time series analysis may be more useful. Econometrics can use tools like Bayesian Structural Times Series models to forecast time series data.
Perhaps the most important opportunity for collaboration relates to causal inference, which can be a natural by-product of big data. Correlation does not always indicate causation. Traditionally, machine learning has been most concerned with pure prediction, but econometricians have developed numerous tools to reveal cause and effect relationships. Combining these tools with machine learning could prove to be a very meaningful collaboration.
Consider a basic causation-correlation example. Police precincts that have a higher amount of police usually also have higher crime rates. There is a correlation, but having more police does not necessarily cause higher crime rates. A strong historical relationship does exist, but
2018 Kaplan, Inc.
Page 177
Topic 74 Cross Reference to GARP Assigned Reading – Varian
it is not really useful for predicting the causal outcome of adding more police to a precinct. One idea to solve this problem is to use econometrics to forecast what would have happened if no additional police were added and then contrast this with what actually did happen.8
This same concept can be applied in many different disciplines. Consider a standard problem in marketing where a firm wants to gauge the effectiveness of an advertising campaign. They could run the new ad campaign in one region and then not run it in another region to contrast the outcomes. There are two big problems with this. First, you may have lost revenue in the control region while the test is ongoing. Second, the contrast could be from an external factor like weather or demographic differences. To avoid these problems, the firm could use econometrics to forecast the expected sales outcome in a region without additional advertising and then run the ads and measure the contrast between the predicted and the actual outcomes. A good model for prediction can be better than a random control group.
8. Donald B. Rubin, Estimating Causal Effects ofTreatments in Randomized and
Nonrandomized Studies, Journal o f Educational Psychology, 66, no. 5 (1974): 688.
Page 178
2018 Kaplan, Inc.
Topic 74 Cross Reference to GARP Assigned Reading – Varian
Ke y Co n c e pt s
LO 74.1 Large datasets require tools that are exponentially more advanced than simple spreadsheet analysis. Overfitting and variable selection are two ongoing challenges that big data present.
LO 74.2 To solve inherent issues like spurious correlations and overfitting, researchers have applied more creative tools to analyzing large datasets. The tools include classification and regression trees, cross-validation, conditional inference trees, random forests, and penalized regression.
LO 74.3 There are several ways in which the field of econometrics can assist the world of machine learning. One way is to use time series forecasting tools that are commonly applied in econometrics to big data, which has traditionally only featured cross-sectional data. Another potential collaboration is to better understand the relationship differences between correlation and causation.
2018 Kaplan, Inc.
Page 179
Topic 74 Cross Reference to GARP Assigned Reading – Varian
Co n c e pt Ch e c k e r s
1.
2.
3.
4.
3.
Which of the following statements is not a problem common to the contemporary world of big data? A. A researcher might find a strong in-sample prediction that does not produce
good out-of-sample results.
B. Traditional spreadsheet analysis is not robust enough to capture relationships
with multiple interactions and millions of data points.
C. Access to data is difficult. D. The periodic presence of spurious correlations requires active variable selection.
Which of the following statements is not involved in conducting a 10-fold cross validation? A. Test your prediction on an out-of-sample dataset to validate accuracy. B. Rotate which fold is the testing set. C. Conduct at least 10 different tests and average the testing results. D. Break a large dataset into 10 smaller subsets of data.
Which of the following statements most accurately describes the process of growing a random forest? A. Select a bootstrapped sample from a large dataset and grow a tree with random
variables that were selected using a lambda (X) tuning parameter. Average the results from a large number of trees that fill out the random forest.
B. Break the full dataset into 10 identifiable subsets and build 10 different trees each having the same variables that were selected using a lambda (X) tuning parameter.
C. Break the full dataset into a random number of small unique datasets. Grow
trees and average the results.
D. Select a bootstrapped sample (with replacement) from a large dataset and grow a tree with random variables and no pruning. Average the results from a large number of trees that fill out the random forest.
Which of the following statements is least likely related to conditional inference trees (ctrees)? A. A ctree can help to better understand if a relationship truly exists between
B. A ctree involves creating multiple trees to test for accuracy. C. A ctree involves splitting variables into the smallest possible factor that can be
variables.
isolated for testing.
D. A ctree will isolate predictors into the most specific terms possible.
The fields of econometrics and machine learning have much that can be shared. Which of the following statements is incorrect concerning the collaboration between these two disciplines? A. Collaboration can be sought to better explore the blurred lines between
correlation and cross-sectional prediction.
B. More collaboration can be done to better understand time series data. C. Collaboration can be sought to better explore the blurred lines between
D. Combining econometric tools with machine learning could prove to be a very
correlation and causation.
meaningful collaboration.
Page 180
2018 Kaplan, Inc.
Topic 74 Cross Reference to GARP Assigned Reading – Varian
Co n c e pt Ch e c k e r An s w e r s
1. C Our modern world is filled with computerized commerce. This trend has created a seemingly endless stream of information that can be dissected using machine learning. Overfitting and spurious correlations are two clear issues and traditional spreadsheet analysis is simply not robust enough to capture the interactions in very large pools of data.
2. A Cross validation is used to conduct testing within a dataset that attempts to create virtual out-of-sample subsets that are actually still in-sample. In this example, the large dataset is broken into 10 folds and then 1 fold is selected for testing. Parameters from the other training sets are tested against the testing set and the testing set is rotated so that each fold gets a turn as the testing set. Parameters from each test are then averaged to get a population parameter used for prediction.
3. D Growing a random forest involves a bootstrapped sample (with replacement) from a larger
data set. Researchers will then grow a tree from this sample. They will construct a large number of trees using computerized assistance and average the results to find the population parameters.
4. B A ctree is only one tree. A random forest is the analysis that constructs multiple trees. A
ctree helps to understand relationships more deeply and it all starts with splitting variables into the smallest identifiable factor that can be isolated. The main idea of a ctree is to isolate predictors into the most specific terms possible.
5. A Current machine learning already has a fairly developed understanding of cross-sectional prediction. The most likely areas for collaboration with the field of econometrics include prediction with time series data and better understanding the blurred lines between correlation and causation. Combining econometric tools with machine learning could prove to be a very meaningful collaboration.
2018 Kaplan, Inc.
Page 181
The following is a review of the Current Issues in Financial Markets principles designed to address the learning objectives set forth by GARP. This topic is also covered in:
M a c h i n e Le a r n i n g : A Re v o l u t i o n i n Ri s k M a n a g e m e n t a n d C o m pl i a n c e ?
Topic 75
Ex a m Fo c u s
Financial institutions have been increasingly looking to complement traditional and less complex regulatory systems and models with more complex models that allow them to better identify risks and risk patterns. This topic focuses on machine learning within artificial intelligence models that have been successfully used in credit risk modeling, fraud detection, and trading surveillance. For the exam, understand the various forms of models, including supervised and unsupervised machine learning, and the three broad classes of statistical problems: regression, classification, and clustering. While machine learning can provide tremendous benefits to financial institutions in combatting risks, there are considerable limitations with these highly complex models, which can be too complex to be reliably used from an audit or regulatory perspective.
Th e P r o c e s s o f M a c h in e Le a r n in g