Pages Navigation Menu

Social media data not representative?

As always, I continue to be fascinated by the rise of social media data usage, its centrality as a data source in so many social science research projects over the past few years, and even how the huge volume of data that social media produces has challenged our existing analytical tools. The latter issue will be the topic of a future post, but for now, I would like to note a recent study that has not yet been published in full, but only as a “commentary” by a team of McGill University and Carnegie Mellon researchers. This study, briefly described here, suggests that the use of social media data as a proxy for a representative sample, is problematic. This will come as no surprise to anyone versed in sampling theory, but even so, the rise of the “big data” movement, and the assumption that large samples will approach a “census” and therefore come quite close to a representative sampling effort, is getting questioned in this work.

Social-MediaThe study authors note that social media has been a “bonanza” for researchers, because the data has often been readily available, but although “fast and cheap” it could also be ultimately misleading. Thousands of academic and industry studies have been published that rely on social media data streams as sources, but these should really be regarded as convenience samples, and not necessarily representative, regardless of how massive the number of observations reported in the data stream used for analysis.

Not everything that can be labeled as ‘Big Data’ is automatically great,” Pfeffer said. He noted that many researchers think or hope that if they gather a large enough dataset they can overcome any biases or distortion that might lurk there. “But the old adage of behavioral research still applies: Know Your Data,” he maintained.

Another observation: “As anyone who has used social media can attest, not all “people” on these sites are even people. Some are professional writers or public relations representatives, who post on behalf of celebrities or corporations, others are simply phantom accounts. Some “followers” can be bought. The social media sites try to hunt down and eliminate such bogus accounts half of all Twitter accounts created in 2013 have already been deleted but a lone researcher may have difficulty detecting those accounts within a dataset.

This debate points back to a larger issue, though. As much as the global media and marketing research community has moved to an online research model, using panels, and abandoning probability sampling models — and recognizing the data quality sacrifices there — we are now seeing a transition to “using available data” such as social media traces, as indications of behavior and preferences. The “law of large numbers” does not apply here, even if we wish it did. Most fields are now struggling with how to integrate data from social media and other available sources of interaction. This study is one of the first to demonstrate how we should question these sources, and how the “fast and cheap” may not be an adequate substitute for quality data collection.