Friday, 28 February 2014

Using “Small Data” to Improve the Use of “Big Data”

Digital Globe
This post was first published on Survey Post on Feb. 3rd, 2014.
Recently, I attended two statistical events in the Washington, DC, area: one was the 23rd Morris Hansen Lecture  on “Envisioning the 2030 U.S. Census”; the other was the SAMSI workshop on “Computational Methods for Censuses and Surveys.” “Big data” was a popular keyword at both events and stirred up discussions on how to utilize it (such as from administrative records and online data sources) for current government statistics, especially when combining big data with  traditional survey data.
Statisticians are exploring new ways in which big data can be used. The US Census has initiated investigations on using administrative records in the 2020 Census. The National Center for Health Statistics (NCHS) has identified some research opportunities combining multiple data sources. University-based researchers  have launched studies on the use of Google trends and other online data in small area estimation.
When big data dominated the mainstream discussion at these events, I started thinking more about “small data.” Can small data help us make better use of big data? Here are some of my thoughts.
  1. Applying a conventional sampling-based approach to big data: more and more administrative records are collected electronically. Statisticians are excited about using these records that may contain information from the entire population for analytic purposes. Literature in the past two decades has extensively discussed the advantages of administrative records. Processing administrative records data, however, can be quite time consuming. In addition, it can be cumbersome to run analyses on these large datasets because of the large data volume. Especially, when analysts use conventional statistical software, such as SAS, Stata and R, it becomes increasingly complex to handle, store and analyze these data. The question is: is there a way to reduce the data volume and increase computational speed? Applying conventional sampling-based approach (e.g. optimal sampling, calibration weighting) may make a big data smaller and more manageable while allowing researchers to maintain decent data quality.
  2. Combining non-probability sample data with probability sample data: many big data, such as data collected by Google/Twitter/Facebook, are not census (population) data. We may treat them as non-probability sample data.  Elements are chosen arbitrarily in these datasets and there is no way to estimate the probability that each element in the population will be included. Also, it is not guaranteed that each element has a chance of being included, making it impossible either to assess the validity (always measured in terms of “bias”) and reality (always measured in terms of “variance”) of the data. One solution to make the data more representative of the entire population is to combine them with probability sample data (e.g. survey data), which can be relatively smaller. This method can also assist us estimating sample variability and identifying potential bias in big data.
  3. Using high-quality small data for measuring and adjusting errors in big data: big data is not only non-representative of the target population, but also carry loads of measurement errors because the construct behind a particular measure in these data can differ from the construct that analysts require. To evaluate errors in the big data and improve precision, small survey data can be collected for validation. Take the National Health Interview Survey (NHIS) as an example. This is a household interview survey with only self-reported data. To improve on analyses of the NHIS self-reported data, an imputation-based strategy for using clinical information from an examination-based health survey (i.e. National Health Nutrition Examination Survey, NHANES) was implemented that predicts clinical values from self-reported values and covariates. Estimates of health measures based on the multiply imputed clinical values are different from those based on the NHIS self-reported data alone and have smaller estimated standard errors than those based solely on the NHANES clinical data. Similarly, we may assess potential errors in big data through a more sophisticated and accurate small survey.
While big data provides us massive and timely information from various sources (e.g. social media, administrative records, small data is simple, easy to collect and process, and can be more accurate and representative.  Can small data help you when dealing with your big data problems?


Dan Liao is a research statistician at RTI International. She currently works on multiple aspects of data processing and  analysis for large, multistage surveys of health care in the United States, including sampling design, calibration weighting, data editing and imputation, statistical disclosure control, and the analysis of survey data. Her survey research interests include multiphase survey designs, combining survey and administrative data, domain estimation, calibration weighting, and regression diagnostics for complex survey data. Dan has a PhD in Survey Methodology from the Joint Program in Survey Methodology at University of Maryland and has published research focusing on regression diagnostics, calibration weighting and predictive modeling.

No comments:

Post a Comment