LO 74.1: Describe the issues unique to big datasets.
Researchers often use a spreadsheet to organize and understand datasets. However, when the spreadsheet expands to a million or more rows, a more robust and relational database is needed. Structured Query Language (SQL) databases are used for the smaller of the large datasets, but customized systems that expand upon SQL are needed for the largest pools of data. According to Sullivan (2012)1, Google answers 100 billion search queries every month and crawls 20 billion URLs every day. This is one example of a significantly large dataset that needs customized databases to properly understand the inherent relationships involved. A system like this would be operated not on a single computer, but rather on a large cluster of computers like the type that can be rented from vendors such as Amazon, Google, and Microsoft.
Professors Note: Using big data to make predictions is precisely what Amazon is trying to do when they make recommendations for additional purchases based on the current product search, previous purchases from the same customer, and alternative purchases made by other customers.
Another potential issue in dealing with a large dataset is known as the overfitting problem. This is encountered when a linear regression captures a solid relationship within the dataset, but has very poor out-of-sample predictive ability. Two common ways to address this
1. Danny Sullivan, Google: 100 Billion Searches per Month, Search to Integrate Gmail,
Launching Enhanced Search App for iOS, Search Engine Land, August 8, 2012, https:// searchengineland. com/google-search-press-129925.
Page 172
2018 Kaplan, Inc.
Topic 74 Cross Reference to GARP Assigned Reading – Varian
problem are to use less complex models and to break the large dataset into small samples to test and validate if overfitting exists.
In practice, researchers work with independently distributed, cross-sectional samples of a larger dataset. This enables them to focus on summarization, prediction, and estimation with a more manageable pool of information. Basic summarization often takes the form of (linear) regression analysis, while prediction seeks to use various tools to predict a value for the dependent variable, y, given a new value of the independent variable, x. This process seeks to minimize a loss function (i.e., sum of squared residuals) that is associated with new out-of-sample observations of x.
Methods are also being deployed to screen variables to find the ones that add the most value to the prediction process. Active variable selection can also help to mitigate spurious correlations and potentially help to decrease overfitting in a world where more and more data becomes available with every internet search and purchase at a retail store.
To o l s a n d Te c h n iq u e s f o r An a l y z in g Big Da t a