Tuesday, 16 January 2018

Tapping Into Advertising Data for Studying International Migration

Ingmar Weber is the Research Director of the Social Computing Group at the Qatar Computing Research Institute (QCRI). As an undergraduate Ingmar studied mathematics at Cambridge University, before pursuing a PhD at the Max-Planck Institute for Computer Science. He subsequently held positions at the Ecole Polytechnique Fédérale de Lausanne and Yahoo Research Barcelona. In his interdisciplinary research, he applies computational methods to large amounts of online data from social media and other sources to study human behaviour at scale. Particular topics of interest include quantifying international migration using digital methods and other data for development projects. He has published over 100 peer-reviewed articles and his work is frequently featured in popular press  Since 2016 he has been selected as an ACM Distinguished Speaker.

International migration is one of the key drivers of demographic change. However, official statistics on “stocks of migrants”, i.e. how many people with origin country X are residing in country Y, are often unreliable. Reasons for this include the free movement of EU nationals within the EU, as well as generally inadequate census and civil registration systems for many developing countries.

Work done by Emilio Zagheni, Krishna Gummadi and myself tries to address some of the shortcomings of traditional methods to create migration statistics by tapping into a new kind of data: audience estimates provided by Facebook.

Facebook and other internet giants collect a rich data set on their users to be able to serve more targeted and more relevant advertising to their users. The data collected includes user self-declared attributes such as age or gender, it includes meta data such as the device or internet connection type used to access the service, it includes third party information such as credit card or voter registration data, and it includes attributes such as topical interests inferred from behavior such as "liking" posts on Facebook or visiting websites with social plugins. See https://www.cision.com/us/2017/07/how-to-improve-social-media-targeting/ for a good list of available targeting options on Facebook, Twitter, LinkedIn and Snapchat.
The detailed users profiles are generally not available to researchers outside the companies. However, aggregate and anonymized data is shared with potential advertisers in the form of audience estimates. Basically, Facebook and other social networks provide advertisers with information on "how many users match criteria X". For example, to help with planning an advertising campaign, an advertiser could inquire "how many monthly active Facebook users are married, male German expats aged 30-50 living in Qatar"? Answer: 120 (as of Dec 20, 2017).

This type of real-time digital census over Facebook's could potentially be of value to augment existing population estimates, in particular for countries where official statistics are unreliable or outdated. However, due to selection biases and an estimated 13% of duplicate or fake accounts it is clear that using this data set as a simplistic enumeration tool for the whole population will not give accurate results. See https://www.theguardian.com/technology/2017/sep/07/facebook-claims-it-can-reach-more-people-than-actually-exist-in-uk-us-and-other-countries for more indications of shortcomings of the data.

In our own research, we do not use the raw advertising audience estimates as the final answer. Rather we treat it as one of potentially many input signals for an estimation task of the kind "how many Germans are living in Qatar today"? As long as the biases in the underlying data are either (i) uniform, e.g. 13% of duplicate or fake Facebook accounts for all countries, or (ii) systematic, e.g. Western Europeans are always less likely to be on Facebook compared to Arab nationals, an appropriately fitted model can account for and correct such biases.

In our paper “Leveraging Facebook'sAdvertising Platform to Monitor Stocks of Migrants”, Emilio, Krishna Gummadi and I show the feasibility of this approach to derive stocks of migrants across different US states and around the world. Concretely, we show that it is indeed possible to build models to make out-of-sample predictions on how many people from a certain origin country are residing in a particular US state. Similarly, it is possible to predict the percentage of expats out of the whole population for countries around the globe.

Potentially, the Facebook audience estimates could also give estimates for stocks of migrants at the sub-national and even the sub-city level. To illustrate this, Matheus Araujo, Michael Aupetit, Yelena Mejova and myself created a data visualization for the Facebook data for Doha: http://fb-doha.qcri.org.

As an example, this shows a density map of Nepali expats across Doha, with the highest density in the Industrial Area. The tool also shows that Nepali expats in Doha are predominantly male (93%) and are Android users (94%). Contrast this to the same map for Western expats with the highest densities in West Bay and the Pearl. Western expats are more gender balanced (44% female) and more likely to own iPhones (56%).  A similar visualization for New York City can be explored at http://fb-nyc.qcri.org [Usage info for the two data visualizations: Select several filters on the left to drill down to smaller populations by nationality, gender or other criteria. Click a selection again to de-select and revert to the whole category such as all nationalities or all genders.]

Given Facebook’s global reach of 2.1B monthly active users we believe there is a lot of potential in using this data source to support global development efforts, in particular given its easy accessibility through official APIs. At the same time, no single data source is a cure-all and many have complementary strengths. Satellite data has truly global reach and can give estimates of population densities but satellite data will never reveal the nationality or gender of earthlings. Call detail records (CDR, https://en.wikipedia.org/wiki/Call_detail_record) are great for studying dynamic changes in population density, but there are limitations for monitoring international migration as people often change their SIM cards once they move.

I’m truly optimistic that as Digital Demography advances and matures as a field and as researchers start to work collaboratively, combining different data sources, we will see more and more scientific work with real impact on the creation of migration statistics. If you’re interested in how to use new data sources and methodologies to help fill data gaps around the globe, please get in touch by email at: iweber -atsignal - hbku.edu.qa.

Relevant slide decks:

Using internet advertising data for studying international migration (https://www.slideshare.net/IngmarWeber/using-internet-advertising-data-for-studying-international-migration)

Digital Demography - WWW'17 Tutorial - Part II (https://www.slideshare.net/IngmarWeber/digital-demography-www17-tutorial-part-ii)

Wrapper libraries to obtain Facebook advertising audience estimates:

Wrapper library in Python (https://github.com/maraujo/pySocialWatcher) by Matheus Araujo (https://sites.google.com/view/matheusaraujo/)

All of my publications are available at https://ingmarweber.de/publications/. Feel free to follow me at https://twitter.com/ingmarweber.