Pages Navigation Menu

Applying Sentiment Analysis to Star Wars: The Force Awakens

Posted by on Jan 20, 2016 in Blog, Data Visualization, Datamining, Geolocation and Psychogeography | Comments Off on Applying Sentiment Analysis to Star Wars: The Force Awakens

One of the more influential sites for data scientists, KDNuggets recently published a case study showing how sentiment analysis could be applied to track the reaction around a film’s early release cycle.  In this case, the film was the 2015 holiday blockbuster Star Wars: The Force Awakens.

10 milliostarwarsSA-1n tweets were collected through the Twitter API, between 12/4/15 and 12/29/15, with the release date on 12/17/15.  About 2.5% contained geolocation data either in form of direct coordinates or human readable location (e.g. New York). The researchers said “…the first thing we looked at was the frequency of Star Wars related tweets in time. It is clearly visible that most of the tweets came from US and UK, which can be easily explained by popularity of Twitter itself in these countries. Next thing to see is the periodicity of day and night, where people tweet more at night than during the day. Also the timezone shift is clearly visible.  More interestingly, we can see the build up before the release, as the number of tweets is increasing for a few days before the world premiere and sky rocketing on this day…”

starwarsSA-2Each tweet was assigned a score between -1 and +1 (-1 being highly negative, +1 highly positive). Results were plotted in a hexbin map, visualizing global sentiment and aggregating by mean within the cell.  Interestingly, average sentiment shows a steady decline as the time passes. There is an observable dip on the day of world premiere but “sentiments keep steadily low the whole time.” The researchers make several interesting observations concerning the results.  Since worldwide interest in the film, at least as reported in the media, approached general hysteria, why doesn’t the Twitter analysis parallel this?

One possible explanation is the inherent sampling bias when working with social network data.  After all, data is derived only from those who voluntarily decide to share. These are usually the ones with stronger opinions – either highly positive or negative, producing a somewhat polarizing effect.  Next,  sentiment analysis is constrained by the modeling methods and tools available for Natural Language Processing (NLP), and one of these constraints is that the algorithms require a data corpus in the English language.  Sentiment analysis that proposes a global sampling plan will necessarily have gaps in its dataset, since non-English texts will be omitted from the analysis.

Read More

Anonymous data may still not be anonymous enough

Posted by on Mar 15, 2015 in Blog, Datamining, Emerging Science and Technology, Technology and Privacy | Comments Off on Anonymous data may still not be anonymous enough

AnonymousdataIt’s already happened several times before, yet still another series of incidents has been released in which individuals connected to “anonymous” or “anonymized” data were ultimately identified by researchers .

This time, data scientists analyzed credit card transactions made by 1.1million people in thousands of stores over 90 days. The data set contained fields such as the date of the transaction, amount charged, and the name of the store. Personal details such as names, account numbers, etc. were removed, but the “uniqueness of people’s behavior” still made them identifiable. Just four random pieces of information was enough to re-identify 90% of shoppers in the database and attach them to other identity records. Researchers at MIT Media Lab, authors of the study, concluded that “the old model of anonymity does not seem to be the right model when we are talking about large scale metadata.”

“A data set’s lack of names, home addresses, phone numbers or other obvious identifiers,” they wrote, “does not make it anonymous nor safe to release to the public and to third parties.”

The full study was published in early 2015 in Science.

Read More

Social media data not representative?

Posted by on Dec 3, 2014 in Data Visualization, Datamining, Media and Markets | Comments Off on Social media data not representative?

As always, I continue to be fascinated by the rise of social media data usage, its centrality as a data source in so many social science research projects over the past few years, and even how the huge volume of data that social media produces has challenged our existing analytical tools. The latter issue will be the topic of a future post, but for now, I would like to note a recent study that has not yet been published in full, but only as a “commentary” by a team of McGill University and Carnegie Mellon researchers. This study, briefly described here, suggests that the use of social media data as a proxy for a representative sample, is problematic. This will come as no surprise to anyone versed in sampling theory, but even so, the rise of the “big data” movement, and the assumption that large samples will approach a “census” and therefore come quite close to a representative sampling effort, is getting questioned in this work.

Social-MediaThe study authors note that social media has been a “bonanza” for researchers, because the data has often been readily available, but although “fast and cheap” it could also be ultimately misleading. Thousands of academic and industry studies have been published that rely on social media data streams as sources, but these should really be regarded as convenience samples, and not necessarily representative, regardless of how massive the number of observations reported in the data stream used for analysis.

Not everything that can be labeled as ‘Big Data’ is automatically great,” Pfeffer said. He noted that many researchers think — or hope — that if they gather a large enough dataset they can overcome any biases or distortion that might lurk there. “But the old adage of behavioral research still applies: Know Your Data,” he maintained.

Another observation: “As anyone who has used social media can attest, not all “people” on these sites are even people. Some are professional writers or public relations representatives, who post on behalf of celebrities or corporations, others are simply phantom accounts. Some “followers” can be bought. The social media sites try to hunt down and eliminate such bogus accounts — half of all Twitter accounts created in 2013 have already been deleted — but a lone researcher may have difficulty detecting those accounts within a dataset.

This debate points back to a larger issue, though. As much as the global media and marketing research community has moved to an online research model, using panels, and abandoning probability sampling models — and recognizing the data quality sacrifices there — we are now seeing a transition to “using available data” such as social media traces, as indications of behavior and preferences. The “law of large numbers” does not apply here, even if we wish it did. Most fields are now struggling with how to integrate data from social media and other available sources of interaction. This study is one of the first to demonstrate how we should question these sources, and how the “fast and cheap” may not be an adequate substitute for quality data collection.

Read More

Cal-Adapt: Understanding California’s Climate Change Predictions

Posted by on Jun 8, 2011 in Blog, Data Visualization, Datamining | Comments Off on Cal-Adapt: Understanding California’s Climate Change Predictions



UC Berkeley’s Geospatial Innovation Facility — with support from the Public Interest Energy Research (PIER) agency — has developed a climate adaptation planning tool it is calling “Cal-Adapt” []? The tool models climate change scenarios in a mapping format, including projections through 2099 for factors such as wildfire risk, sea-level rise and flood risk,? temperature fluctuation, and even snow pack. ? It is particularly interesting to explore the many microclimates that California offers.

The entire, very extensive dataset is available for download, and should keep anyone busy for a long time.

Read More

Visualizing Science Readers

Posted by on Dec 8, 2010 in Blog, Data Visualization, Datamining, Geolocation and Psychogeography, Media and Markets | Comments Off on Visualizing Science Readers

Curious about what scientists might be reading?? Springer (noted publisher of more than 5 million scientific and academic titles) has launched a new analytics tool that reveals how its users and subscribers are downloading its content.

There are a number of interactive visualization tools at the site,? including a world map illustrating the origin of download requests,? an updating topical/keyword tag cloud,? and displays of real time downloads.?? This type of information could be used in many, many ways.?? On the commercial level,? authors and editors can get a fascinating view of which topics are emerging and in what geographic markets.? Scholarly and research applications abound, as well.? As a community,? scientists have an outsized impact on society and understanding trends in their work and interests could be useful.

Read More

Using the GPS in your celphone for traffic reporting…

Posted by on Feb 20, 2007 in Blog, Data Visualization, Datamining, Technology and Privacy | Comments Off on Using the GPS in your celphone for traffic reporting…

IntelliOne Technologies has just launched a real-world test of Need4Speed, a real-time traffic-monitoring system that tracks drivers’ cell phones. From their website: ‘Unlike any other solution available today, the IntelliOne Roadway Speed Measurement System produces live roadway speeds for all highways and surface streets where mobile phone coverage exists, accurate to within three miles per hour.’ Of course, any compulsory phone-tracking system raises privacy concerns. According to an article on LiveScience, ‘the personal identification data of users will be stripped from cell phone signals before they are processed by IntelliOne’s software.’ The cell phone companies have this data, but IntelliOne says they won’t be keeping their copy."

Read More