One of the more influential sites for data scientists, KDNuggets recently published a case study showing how sentiment analysis could be applied to track the reaction around a film’s early release cycle. In this case, the film was the 2015 holiday blockbuster Star Wars: The Force Awakens.
10 million tweets were collected through the Twitter API, between 12/4/15 and 12/29/15, with the release date on 12/17/15. About 2.5% contained geolocation data either in form of direct coordinates or human readable location (e.g. New York). The researchers said “…the first thing we looked at was the frequency of Star Wars related tweets in time. It is clearly visible that most of the tweets came from US and UK, which can be easily explained by popularity of Twitter itself in these countries. Next thing to see is the periodicity of day and night, where people tweet more at night than during the day. Also the timezone shift is clearly visible. More interestingly, we can see the build up before the release, as the number of tweets is increasing for a few days before the world premiere and sky rocketing on this day…”
Each tweet was assigned a score between -1 and +1 (-1 being highly negative, +1 highly positive). Results were plotted in a hexbin map, visualizing global sentiment and aggregating by mean within the cell. Interestingly, average sentiment shows a steady decline as the time passes. There is an observable dip on the day of world premiere but “sentiments keep steadily low the whole time.” The researchers make several interesting observations concerning the results. Since worldwide interest in the film, at least as reported in the media, approached general hysteria, why doesn’t the Twitter analysis parallel this?
One possible explanation is the inherent sampling bias when working with social network data. After all, data is derived only from those who voluntarily decide to share. These are usually the ones with stronger opinions – either highly positive or negative, producing a somewhat polarizing effect. Next, sentiment analysis is constrained by the modeling methods and tools available for Natural Language Processing (NLP), and one of these constraints is that the algorithms require a data corpus in the English language. Sentiment analysis that proposes a global sampling plan will necessarily have gaps in its dataset, since non-English texts will be omitted from the analysis.
It’s already happened several times before, yet still another series of incidents has been released in which individuals connected to “anonymous” or “anonymized” data were ultimately identified by researchers .
This time, data scientists analyzed credit card transactions made by 1.1million people in thousands of stores over 90 days. The data set contained fields such as the date of the transaction, amount charged, and the name of the store. Personal details such as names, account numbers, etc. were removed, but the “uniqueness of people’s behavior” still made them identifiable. Just four random pieces of information was enough to re-identify 90% of shoppers in the database and attach them to other identity records. Researchers at MIT Media Lab, authors of the study, concluded that “the old model of anonymity does not seem to be the right model when we are talking about large scale metadata.”
“A data set’s lack of names, home addresses, phone numbers or other obvious identifiers,” they wrote, “does not make it anonymous nor safe to release to the public and to third parties.”
The full study was published in early 2015 in Science.
Visualizing the relationships between a handful of major food conglomerates and their underlying brands has been done very neatly by Oxfam International in this graphic. The graphic details how only a few corporations control most of the available brands in groceries, and focuses on 10 of the world’s most powerful food and beverage companies: Coca-Cola, PepsiCo, Unilever, Danone, Mars, Mondelez International, Kellogg’s, General Mills, Nestle, and Associated British Foods. Oxfam calls these companies the Big 10 and details their environmental impact on a website devoted to the nonprofit’s “Behind the Brands” campaign.
Oxfam reports that the Big 10 emitted 263.7 million tons of greenhouse gas in 2013. If the companies were a nation, they would be the 25th biggest polluter in the world.
As always, I continue to be fascinated by the rise of social media data usage, its centrality as a data source in so many social science research projects over the past few years, and even how the huge volume of data that social media produces has challenged our existing analytical tools. The latter issue will be the topic of a future post, but for now, I would like to note a recent study that has not yet been published in full, but only as a “commentary” by a team of McGill University and Carnegie Mellon researchers. This study, briefly described here, suggests that the use of social media data as a proxy for a representative sample, is problematic. This will come as no surprise to anyone versed in sampling theory, but even so, the rise of the “big data” movement, and the assumption that large samples will approach a “census” and therefore come quite close to a representative sampling effort, is getting questioned in this work.
The study authors note that social media has been a “bonanza” for researchers, because the data has often been readily available, but although “fast and cheap” it could also be ultimately misleading. Thousands of academic and industry studies have been published that rely on social media data streams as sources, but these should really be regarded as convenience samples, and not necessarily representative, regardless of how massive the number of observations reported in the data stream used for analysis.
“Not everything that can be labeled as ‘Big Data’ is automatically great,” Pfeffer said. He noted that many researchers think — or hope — that if they gather a large enough dataset they can overcome any biases or distortion that might lurk there. “But the old adage of behavioral research still applies: Know Your Data,” he maintained.
Another observation: “As anyone who has used social media can attest, not all “people” on these sites are even people. Some are professional writers or public relations representatives, who post on behalf of celebrities or corporations, others are simply phantom accounts. Some “followers” can be bought. The social media sites try to hunt down and eliminate such bogus accounts — half of all Twitter accounts created in 2013 have already been deleted — but a lone researcher may have difficulty detecting those accounts within a dataset.”
This debate points back to a larger issue, though. As much as the global media and marketing research community has moved to an online research model, using panels, and abandoning probability sampling models — and recognizing the data quality sacrifices there — we are now seeing a transition to “using available data” such as social media traces, as indications of behavior and preferences. The “law of large numbers” does not apply here, even if we wish it did. Most fields are now struggling with how to integrate data from social media and other available sources of interaction. This study is one of the first to demonstrate how we should question these sources, and how the “fast and cheap” may not be an adequate substitute for quality data collection.
Brightpoint Consulting recently released a small collection of interactive visualizations based on open, publicly available data from the US government. Characterized by a rather organic graphic design style and color palette, each visualization makes a socially and politically relevant dataset easily accessible.
The custom chore diagram titled Political Influence [brightpointinc.com] highlights the monetary contributions made by the top Political Action Committees (PAC) for the 2012 congressional election cycle, for the House of Representatives and the Senate.
The hierarchical browser 2013 Federal Budget [brightpointinc.com] reveals the major flows of spending in the US government, at the federal, state, and local level, such as the relationship of spending between education and defense.
The circular flow chart United States Trade Deficit [brightpointinc.com] shows the US Trade Deficit over the last 11 years by month. The United States sells goods to the countries at a the top, while vice versa, the countries at the bottom sell goods to the US. The dollar amount in the middle represents the cumulative deficit over this period of time.
Visits [v.isits.in] automatically visualizes personal location histories, trips and travels by aggregating geotagged one’s Flickr collection with a Google Maps history. developed by Alice Thudt, Dominkus Baur and prof. Sheelagh Carpendale, the map runs locally in the browser, so no sensitive data is uploaded to external servers.
The timeline visualization goes beyond the classical pin representation, which tend to overlap and are relatively hard to read. Instead, the data is shown as ‘map-timelines’, a combination of maps with a timeline that convey location histories as sequences of maps: the bigger the map, the longer the stay. This way, the temporal sequence is clear, as the trip starts with the map on the left and continues towards the right.
A place slider allows the adjusting of the map granularity, reaching from street-level to country-level.
Read the academic research here [PDF]
Culturegraphy [culturegraphy.com], developed by “Information Model Maker” Kim Albrecht reveals represent complex relationships of over 100 years of movie references.
Movies are shown as unique nodes, while their influences are depicted as directed edges. The color gradients from blue to red that originate in the 1980s denote the era of postmodern cinema, the era in which movies tend to adapt and combine references from other movies.
Although the visualizations look rather minimalistic at first sight, their interactive features are quite sophisticated and the resulting insights are naturally interesting. Therefore, do not miss out the explanatory movie below.
Via @albertocairo .
Mapping Music on Facebook [facebookstories.com] by Stamen Design for Facebook shows the dynamic characteristics of the typical listening activity across Facebook.
Inspired by the dynamic movement of a graphic equalizer, Beatquake maps the popularity of the top three most popular songs in the U.S., each day over the course of 90 days, by way of vertically moving particles.
Colored layers, each representing one song, rise and fall over geographic locations to correspond with the number of plays in that area. The texture of the map is driven by BPMs (beats per minute), and thus changes as one song overtakes another in popularity.
?Groupon is Hastening the Demise of the Newspaper Industry,? wrote a trade publication in April 2011.?? However,?? some newspapers are betting that ?daily deal? offerings could reinvigorate the industry.?? Newspapers are turning to startups such as?? Shoutback and Nimble Commerce?? and others offering consulting and white-label systems to power deal mechanisms. And newspapers have other things many other Groupon clones don?t ? large local audiences that are still used to turning to newspapers for coupons,? and a sales force with established local relationships.
Reportedly,?? The Boston Globe is offering its own? Boston Deals promo after? trying a partnership? with BuyWithMe last year (and SCVNGR, also last year)? as it? moves to separate? its online content from a potentially more lucrative e-commerce business.??? Boston Phoenix? offers? deals,? Star Tribune? in the Twin Cities offers STeals.
The struggle newspapers have had in recent years to make money from their content is obvious to all.?? Paywalls,???? apps, etc., have all been attempted.
But newspapers have failed to leverage the single most important advantage they have over the emerging media types:???? local audiences and local sales relationships.?? An intelligent ?Daily Deals? offering could be the key.???? The Globe, Phoenix and Star Tribune have each come out with their own versions of this play. Or they could aggregate local deals from Groupon and its numerous clones, Yipit-style.?? ((These last observations are from MIT?s Advertising Lab?s excellent blog.)
Has any newspaper actually tried to recapture the classifieds business from Craiglist????? Newspapers ought to be able to offer online classifieds with more power and usability than the Craigslist version, with a little planning and proper research.
In the past year I?ve spent a lot of time talking to local businesses in southern California and trying to understand their experience with the ?Groupon? concept, which many have tried and abandoned.?? The reason??? Yes,?? ?daily deal? promos bring business in the door,?? but sporadically and often at a loss to the retailer.?? Local businesses have not figured out how to capture the Daily Deals crowd and turn them into reliable repeat customers, and that is something that newspapers will need to consider if they plan on competing in this deal space.?? Newspapers could step into this void and help local businesses profit from Daily Deals, thereby strengthening their own brands and relationships.
The YouTube Trends Map [youtube.com] is a visualization of the most shared and viewed videos in various regions across the United States over the last 12 to 24 hours. It accompanies the more analytical Trends Dashboard to provide a full overview of the the rising videos and trends on YouTube in terms of actual views or shares, filtered by geographical location, gender or age of the viewers.
The demographic information of viewers is solely based on the information reported by registered, logged-in users in their YouTube account profiles. Next to the geographical map, the Trends Map also include a series of horizontal bar graphs, each representing a graphical summary of the top videos for a different demographic. Within each bar, a video is represented by a colorful segment, the colors are drawn from the video’s thumbnail. The width of a video’s segment reflects the number of regions on the map where the video is #1.
. YouTube View Tracking Data Visualization
. YouTube Swarm Related Videos
British company Path Intelligence is testing a shopper tracking system called Footpath in a southern California shopping mall. The technology picks up the unique IDs in shopper’s cell phones in order to study their movements through stores and throughout the mall.
Describing the product: “Path Intelligence detects each shopper carrying a phone that enters the mall. It identifies how long they stay, which shops they visit, whether or not they have visited before and how they travel around the mall during their trip. Path Intelligence enables data-driven analysis of a mall, the retail tenancy mix, the impact of marketing events and much more. Path Intelligence specialises in digitising real-world behaviour to enable you to recognise profitable opportunities.”
Used during the holiday shopping season in late 2011 in the Promenade Mall in Temecula, California (North and inland from San Diego), the system is said to already be in use in some European and Australian shopping centers. It’s unclear if shoppers are alerted in any way about the fact that they will be tracked via their cellphone ID, or how the data collected will be used. While retailers have long collected whatever information they can about shopper behavior, including how consumers tend to move about their stores, this is the first time they are able to uniquely identify a shopper and passively track return visits.
Even fashion models are being disintermediated.? Some catalogs trade lavish description for models (notably,? J. Peterman) while others keep the idea of human models,? but save on the pesky expense of photo shoots by computer-generating the women.? At H&M,? skin, hair and eye color are changed with the click of a mouse to give the impression of many different models.
Market research has long shown that people look first and longest at other people.? In catalog marketing,? the exact items in the colors worn by the model nearly always sell most.? Will we continue to be as fascinated by facsimile humans as we are by each other?
UC Berkeley scientists have demonstrated a method to reconstruct words that a person may be thinking, by examining brainwaves using fMRI. The technique reported in PLoS Biology relies on gathering electrical signals directly from patients’ brains, via implanted electrodes. Computer models reconstructed words/sounds from the signal patterns.
Although the possible uses include helping comatose, locked-in patients, or the speech-impaired to communicate, concerns have been raised that the method could be used for interrogation. fMRI has already been used by federal law enforcement agencies to detect signs of deception in detained suspects.
Brands have a psychological reality in that they provide differentiation for goods that might otherwise be seen as interchangeable commodities.? I give you vodka, salt, and baking soda, as pertinent examples.
The advent of 3-D printers in recent years, though,? introduces a possibility that branded goods may have less utility in the future.? The Pirate Bay,? already a source of downloadable content of all types (some of it probably of dubious legality) has created a new category of downloadable content containing the code for producing items on 3-D printers.? Once goods become “content” and can be produced by anyone,? what happens to brands? The ability to easily turn digital content into a physical object changes the marketing premise. Nicknamed “physibles”? over at the Pirate Bay,? these digital blueprints lead the way to a nearby future in which “…you will print your spare parts for your vehicles. You will download your sneakers within 20 years.”
Daily Mail: “Translucent TV: Lumus’ PD-18-2 is a set of spectacles that can beam high-quality images directly into your eyes but allows the user to see through the images too.”