Big Data – What Is It ?
Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured. Much has been written on the big data trend and how it can serve as the basis for innovation, differentiation and growth.
According to IDC, it is imperative that organizations and IT leaders focus on the ever-increasing volume, variety and velocity of information that forms big data.1
- Volume. Many factors contribute to the increase in data volume – transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, etc. In the past, excessive data volume created a storage issue. But with today’s decreasing storage costs, other issues emerge, including how to determine relevance amidst the large volumes of data and how to create value from data that is relevant.
- Variety. Data today comes in all types of formats – from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions. By some estimates, 80 percent of an organization’s data is not numeric! But it still must be included in analyses and decision making.
- Velocity. According to Gartner, velocity “means both how fast data is being produced and how fast the data must be processed to meet demand.” RFID tags and smart metering are driving an increasing need to deal with torrents of data in near-real time. Reacting quickly enough to deal with velocity is a challenge to most organizations.
Big data according to SAS
At SAS, we consider two other dimensions when thinking about big data:
- Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something big trending in the social media? Perhaps there is a high-profile IPO looming. Maybe swimming with pigs in the Bahamas is suddenly the must-do vacation activity. Daily, seasonal and event-triggered peak data loads can be challenging to manage – especially with social media involved.
- Complexity. When you deal with huge volumes of data, it comes from multiple sources. It is quite an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Data governance can help you determine how disparate data relates to common definitions and how to systematically integrate structured and unstructured data assets to produce high-quality information that is useful, appropriate and up-to-date.
Ultimately, regardless of the factors involved, we believe that the term big data is relative; it applies (per Gartner’s assessment) whenever an organization’s ability to handle, store and analyze data exceeds its current capacity.
Examples of big data
- RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems. Tweet
- 10,000 payment card transactions are made every second around the world.2 Tweet
- Walmart handles more than 1 million customer transactions an hour.3 Tweet
- 340 million tweets are sent per day. That’s nearly 4,000 tweets per second.4 Tweet
- Facebook has more than 901 million active users generating social interaction data.5 Tweet
- More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones. Tweet
Uses for big data
So the real issue is not that you are acquiring large amounts of data (because we are clearly already in the era of big data). It’s what you do with your big data that matters. The hopeful vision for big data is that organizations will be able to harness relevant data and use it to make the best decisions.
Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to:
- Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory.
- Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.
- Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
- Quickly identify customers who matter the most.
- Generate retail coupons at the point of sale based on the customer’s current and past purchases, ensuring a higher redemption rate.
- Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
- Analyze data from social media to detect new market trends and changes in demand.
- Use clickstream analysis and data mining to detect fraudulent behavior.
- Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.
High-performance analytics, coupled with the ability to score every record and feed it into the system electronically, can identify fraud faster and more accurately.
Many organizations are concerned that the amount of amassed data is becoming so large that it is difficult to find the most valuable pieces of information.
- What if your data volume gets so large and varied you don’t know how to deal with it?
- Do you store all your data?
- Do you analyze it all?
- How can you find out which data points are really important?
- How can you use it to your best advantage?
Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. What is the point of collecting and storing terabytes of data if you can’t analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data.
You now have two choices:
- Incorporate massive data volumes in analysis. If the answers you are seeking will be better provided by analyzing all of your data, go for it. The game-changing technologies that extract true value from big data – all of it – are here today. One approach is to apply high-performance analytics to analyze the massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics.
- Determine upfront which big data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when you query the data do you discover what is relevant. We now have the ability to apply analytics on the front end to determine data relevance based on context. This analysis can be used to determine which data should be included in analytical processes and which can be placed in low-cost storage for later availability if needed.
Now you can run hundreds and thousands of models at the product level – at the SKU level – because you have the big data and analytics to support those models at that level.
A number of recent technology advancements are enabling organizations to make the most of big data and big data analytics:
- Cheap, abundant storage and server processing capacity.
- Faster processors.
- Affordable large-memory capabilities, such as Hadoop.
- New storage and processing technologies designed specifically for large data volumes, including unstructured data.
- Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
- Cloud computing and other flexible resource allocation arrangements.
Big data technologies not only support the ability to collect large amounts of data, they provide the ability to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
It is very important to understand that not all of your data will be relevant or useful. But how can you find the data points that matter most? It is a problem that is widely acknowledged. “Most businesses have made slow progress in extracting value from big data. And some companies attempt to use traditional data management practices on big data, only to learn that the old rules no longer apply,” says Dan Briody, in the 2011 Economist Intelligence Unit’s publication, “Big Data: Harnessing a Game-Changing Asset.”
Big data solutions from SAS
How can you make the most of all that data, now and in the future? It is a twofold proposition. You can only optimize your success if you weave analytics into your big data solution. But you also need analytics to help you manage the big data itself.
There are several key technologies that can help you get a handle on your big data, and more important, extract meaningful value from it.
- Information management for big data. Many vendors look at big data as a discussion related to technologies such as Hadoop, NoSQL, etc. SAS takes a more comprehensive data management/data governance approach by providing a strategy and solutions that allow big data to be managed and used more effectively.
- High-performance analytics. By taking advantage of the latest parallel processing power, high-performance analytics lets you do things you never thought possible because the data volumes were just too large.
- High-performance visual analytics. High-performance visual analytics lets you explore huge volumes of data in mere seconds so you can quickly identify opportunities for further analysis. Because it’s not just that you have big data, it’s the decisions you make with the data that will create organizational gains.
- Flexible deployment options for big data. Flexible deployment models bring choice. High-performance analytics from SAS can analyze billions of variables, and those solutions can be deployed in the cloud (with SAS or another provider), on a dedicated high-performance analytics appliance or within your existing IT infrastructure, whichever best suits your organization’s requirements.
1 Source: IDC. “Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO,” September 2011.
2 Source: American Bankers Association, March 2009
3 Source: http://www.economist.com
4 Source: http://blog.twitter.com
5 Source: http://newsroom.fb.com/
What is big data?
Big data is not a precise term; rather it’s a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used.
“My belief is that data is a terrible thing to waste. Information is valuable. In running our business, we want to make sure that we’re not leaving value on the table—value that can create better experiences for customers or better financial results for the company.”
—Johann Schleier-Smith, Tagged.com
“Data growth is a factor that
everybody is trying to deal with.
We’re seeing tremendous growth in
the size of data, year on year, as I
think everyone is. Finding effective
approaches for containing cost
so that it doesn’t run away with
your budget is an issue. Another
major challenge is dealing with
unstructured data. How do you
manage that data effectively? How do
you control its growth? And how do
you actually make that data part of
the information fabric that people can
draw upon to make decisions or look
—Rich Aducci, Boston Scientific
“Being able to look at big data can shorten your time to information, which has immediate value. For instance, if I want to know how a new product launch is going, I can analyze millions of social media conversations and know if we’re successful instantly, instead of waiting months for a customer satisfaction survey.”
—Guy Chiarello, JPMorgan Chase
“It’s more than a technology shift. There has to be a mindset shift about what you can do with data. For years, what CIOs had to deal with is managing information. It was all about managing information efficiently: how much can you compress it, de-dupe it, take snapshots and act upon it. Planning for big data extends that efficiency so you can do more with that data very quickly. The classic IT organizations are used to data warehousing, business intelligence where data is updated once, twice, maybe four times a month. Now we’re at the point where you’ve got access to everything all the time through real and uptime data.”
—Sanjay Mirchandani, EMC
Big data in the cloud
Cloud models tame big data while extracting
business value from it. This delivery model gives
organizations a flexible option for achieving
the efficiencies, scalability, data portability and
affordability they need for big data analytics.
Cloud models encourage access to data and
provide an elastic pool of resources to handle
massive scale, solving the problem of how to
store huge data volumes and how to amass the
computing resources required to manipulate it. In
the cloud, data is provisioned and spread across
multiple sites, allowing it to sit closer to the users
who require it, speeding response times and
boosting productivity. And, because cloud makes
IT resources more efficient and IT teams more
productive, enterprise resources are freed to be
Cloud services specifically designed for big
data analysis are starting to emerge, providing
platforms and tools designed to perform analytics
quickly and efficiently. Companies who recognize
the importance of big data but don’t have the
resources to build the required infrastructure
or acquire the necessary tools to exploit it will
benefit from considering these cloud services.
“Real-time data will continue to grow at a faster
pace than the capability to move it. Unless we
change the way we address the problem, we are
going to find ourselves constantly struggling to
squeeze information through very narrow tubes.
I believe that we’re going to be faced with a
situation where more and more we have to do
the analytics where the data resides. Instead of
moving the data for processing, we are going to
move analytics closer to the data.”
—Dimitris Mavroyiannis, Eurobank EFG Group
“The cloud will play an important role in big data.
I think it’s going to be increasingly rare that you’re
going to be able to run all this [infrastructure] at
home. And why would you, in some cases?”
—Deirdre Woods, The Wharton School of the
University of Pennsylvania
“Because it has become so cheap and easy to store data, a lot of companies have operated under this idea of, ‘let me just store it, I’ll deal with it when I figure out how to deal with it.’ But now the velocity of growth is increasing. The amount of storage we’re using is proliferating. All of that compels us to bring a business discipline to this ecosystem that helps us understand what needs to be retained and for how long. Big data raises the stakes on why content management needs to be promptly and successfully woven into the business operation. Then it’s as simple as basic records management blocking and tackling.”
—John Chickering, Fidelity Investments
“With compute power being what it is … you don’t need to build big tables and land them on disks and keep them on disks. You build them on the fly. That has reduced data storage needs dramatically. It’s a form of, I guess, intellectual compression, not algorithmic compression. It’s just smart data modeling and using the power of what you’ve got.”
—Ian Willson, The Boeing Company
“Big data calls for a lot more creativity in how you use data. You have to be way more creative about where you look for business value: if I combined this data with this data, what could it tell me?”
—Joe Solimando, Disney Consumer Products
“Instead of waiting for big data to stop operations, we should better organize or archive our data, manage it over its lifecycle and actually get rid of it. You can move out of mitigation mode by doing a better job of managing your information up front—in other words putting the data to more efficient use.”
—David Blue, The Boeing Company
“When people here have an idea, and they see they could do something differently if we make [processes] more real time in the future, and they [make changes to the service] and the numbers go up by 10%, people get really excited. So what I want to do is create that type of energy and enthusiasm. That’s what I want to be the dominant dynamic of the workplace. We really have that here. It’s pretty fun. People are getting results on a routine basis, and it’s because we’ve created frictionless access to data.”
—Johann Schleier-Smith, Tagged.com
“Our Wharton Research Data Services experience has shown us the value of organizing data so you can look at multiple data sources to analyze and draw conclusions. We’ve seen over years—and it’s a trend that’s certainly increasing—when people use the word “data” they want to see three or four and now five, six data sets joined.”
—Deirdre Woods, The Wharton School of the University of Pennsylvania
“Every vendor wants to bundle everything and provide a one-stop shop. This makes sense, and I don’t have a problem with that, as long the vendor doesn’t lock me in into a specific solution. And I think the only way that this can be avoided is if the vendor follows industry standards. We’ve moved into a world where standards prevail, especially in big data analytics, where data originate from multiple sources. Vendors that provide standard-based big data solutions are much more likely to be preferred.”
—Dimitris Mavroyiannis, Eurobank EFG Group
with business users will expand the capabilities of IT workers, bringing them closer to the strategic goal of aligning business and IT. Business workers will gain a better understanding of the capabilities and limitations of technology.
“At EMC we are developing roles for what we call the data scientist: people with a good amount of data competence who have skill sets partitioning information to make it easier to work with. The capabilities people in this role bring in the value chain of an organization are pretty tremendous. The central function is to add core business value and mask (for business users) the heavy lifting that happens behind the scenes in IT.”
—Sanjay Mirchandani, EMC
“The biggest challenge is the people. Big data requires non-traditional IT skill sets. We’re bringing in more Ph.D.s and people with expertise in outside areas to help our business users work with information.”
—Guy Chiarello, JPMorgan Chase
“A few years ago, I would’ve said the
value in BI overshadowed that of big
data, but now I’d say the relationship
is even, if not reversed. There’s more
outside information now that can
be digitized—about user behaviors
and external conditions—that can
be layered on top of structured data.
This type of analysis opens a window
into not just what happened and
why, but it also helps you see what’s
—Guy Chiarello, JPMorgan Chase
What is big data?
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.
- Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
- Convert 350 billion annual meter readings to better predict power consumption
Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
- Scrutinize 5 million trade events created each day to identify potential fraud
- Analyze 500 million daily call detail records in real-time to predict customer churn faster
Variety: Big data is any type of data – structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
- Monitor 100’s of live video feeds from surveillance cameras to target points of interest
- Exploit the 80% data growth in images, video and documents to improve customer satisfaction
Veracity: 1 in 3 business leaders don’t trust the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows.
Big data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. Until now, there was no practical way to harvest this opportunity. Today, IBM’s platform for big data uses state of the art technologies including patented advanced analytics to open the door to a world of possibilities.
“At IBM, big data is about the ‘the art of the possible.’ . . . The company is certainly a leader in this space.”
“Four Vendor Views on Big Data and Big Data Analytics: IBM”
Hurwitz & Associates, Fern Halper, January 2012
From Wikipedia, the free encyclopedia
In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. Scientists regularly encounter limitations due to large data sets in many areas, includingmeteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created. The challenge for Large enterprises is who should own big data initiatives that straddle the entire organization. 
Big data is difficult to work with using relational databases and desktop statistics and visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers”. What is considered “big data” varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”
Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, a new platform of “big data” tools has arisen to handle sensemaking over large quantities of data, as in the Apache Hadoop Big Data Platform.
In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this “3Vs” model for describing big data. In 2012, Gartner updated its definition as follows: “Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.
The Large Hadron Collider (LHC) experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and not recording more than 99.999% of these streams, there are 100 collisions of interest per second.   
- As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
- If all sensor data were to be recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times higher than all the other sources combined in the world.
Science and research
- When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of data every five days.
- Decoding the human genome originally took 10 years to process; now it can be achieved in one week.
- Computational social science — Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviour and real-world economic indicators.The authors of the study examined Google queries logs made by Internet users in 45 different countries in 2010 and calculated the ratio of the volume of searches for the coming year (‘2011’) to the volume of searches for the previous year (‘2009’), which they call the ‘future orientation index’. They compared the future orientation index to the per capita GDP of each country and found a strong tendency for countries in which Google users enquire more about the future to exhibit a higher GDP. The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data.
- In 2012, the Obama administration announced the Big Data Research and Development Initiative, which explored how big data could be used to address important problems facing the government. The initiative was composed of 84 different big data programs spread across six departments.
- The United States Federal Government owns six of the ten most powerful supercomputers in the world.
- The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. 
- Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.
- Facebook handles 40 billion photos from its user base.
- FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.
- The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates.
Following decades of work in the area of the effective usage of information and communication technologies for development (orICT4D), it has been suggested that Big Data can make important contributions to international development. On the one hand, the advent of Big Data delivers the cost-effective prospect to improve decision-making in critical development areas such as health care, employment, economic productivity, crime and security, and natural disaster and resource management. On the other hand, all the well-known concerns of the Big Data debate, such as privacy, interoperability challenges, and the almighty power of imperfect algorithms, are aggravated in developing countries by long-standing development challenges like lacking technological infrastructure and economic and human resource scarcity. “This has the potential to result in a new kind of digital divide: a divide in data-based intelligence to inform decision-making.”
“Big data” has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM,Microsoft, SAP, EMC, and HP have spent more than $15 billion on software firms only specializing in data management and analytics. This industry on its own is worth more than $100 billion and growing at almost 10 percent a year: about twice as fast as the software business as a whole.
Developed economies make increasing use of data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. The world’s effective capacity to exchange information through telecommunication networks was 281petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include A/B testing, association rule learning, classification, cluster analysis,crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms,machine learning, natural language processing, neural networks, pattern recognition,anomaly detection, predictive modelling, regression, sentiment analysis, signal processing, supervised and unsupervised learning, simulation, time series analysis andvisualisation. Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation, such as multilinear subspace learning. Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data-mining grids, distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet.
Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.
The practitioners of big data analytics processes are generally hostile to slower shared storage, preferring direct-attached storage (DAS) in its various forms from solid state disk (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—SAN and NAS—is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.
Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.
There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011did not favour it.
In March 2012, The White House announced a national “Big Data Initiative” that consisted of six Federal departments and agencies committing more than $200 million to big data research projects.
The initiative included a National Science Foundation “Expeditions in Computing” grant of $10 million over 5 years to the AMPLabat the University of California, Berkeley. The AMPLab also received funds from DARPA, and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion to fighting cancer.
The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over 5 years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute, led by the Energy Department’sLawrence Berkeley National Laboratory. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the Department’s supercomputers.
The U.S. state of Massachusetts announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions. The Massachusetts Institute of Technology hosts the Intel Science and Technology Center for Big Data in the MIT Computer Science and Artificial Intelligence Laboratory, combining government, corporate, and institutional funding and research efforts.
Critiques of the Big Data paradigm come in two flavors, those that question the implications of the approach itself, and those that question the way it is currently done.
Critiques of the Big Data paradigm
Broad critiques have been leveled at Chris Anderson’s assertion that big data will spell the end of theory: focusing in particular on the notion that big data will always need to be contextualized in their social, economic and political contexts. Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, “big data”, no matter how comprehensive or well analyzed, needs to be complemented by “big judgment”, according to an article in the Harvard Business Review. Much in the same line, it has been pointed out that the decisions based on the analysis of Big Data are inevitably “informed by the world as it was in the past, or, at best, as it currently is”. Fed by a large number of data on past experiences, algorithms can predict future development if the future is similar to the past. If the systems dynamics of the future change, the past can say little about the future. For this, it would be necessary to have a thorough understanding of the systems dynamic, which implies theory. As a response to this critique it has been suggested to combine Big Data approaches with computer simulations, such as agent-based models, for example . Those are increasingly getting better in predicting the outcome of social complexities of even unknown future scenarios through computer simulations that are based on a collection of mutually interdependent algorithms.
Consumer privacy advocates are concerned about the threat to privacy represented by increasing storage and integration ofpersonally identifiable information; expert panels have released various policy recommendations to conform practice to expectations of privacy.
Critiques of Big Data execution
Danah Boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data. This approach may lead to results biased in one way or another. Integration across heterogeneous data resources — some that might be considered “big data” and others not — presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.
- Bloom filter
- Cloud computing
- Data assimilation
- Database theory
- Database-centric architecture
- Data Intensive Computing
- Data structure
- Multilinear subspace learning
- Object database
- Online database
- Real-time database
- Relational database
- Tuple space
- Unstructured data
- Very large database
- Extremely large databases
- Industrial Internet
- ^ White, Tom (10 May 2012). Hadoop: The Definitive Guide. O’Reilly Media. p. 3. ISBN 978-1-4493-3877-0.
- ^ “MIKE2.0, Big Data Definition”.
- ^ Kusnetzky, Dan. “What is “Big Data?””. ZDNet.
- ^ Vance, Ashley (22 April 2010). “Start-Up Goes After Big Data With Hadoop Helper”. New York Times Blog.
- ^ a b c d e f g “Data, data everywhere”. The Economist. 25 February 2010. Retrieved 9 December 2012.
- ^ “E-Discovery Special Report: The Rising Tide of Nonlinear Review”. Hudson Global. Retrieved 1 July 2012. by Cat Casey and Alejandra Perez
- ^ “What Technology-Assisted Electronic Discovery Teaches Us About The Role Of Humans In Technology — Re-Humanizing Technology-Assisted Review”. Forbes. Retrieved 1 July 2012.
- ^ Francis, Matthew (2012-04-02). “Future telescope array drives development of exabyte processing”. Retrieved 2012-10-24.
- ^ Watters, Audrey (2010). “The Age of Exabytes: Tools and Approaches for Managing Big Data” (Website/Slideshare). Hewlett-Packard Development Company. Retrieved 2012-10-24.
- ^ “Community cleverness required”. Nature 455 (7209): 1. 4 September 2008. doi:10.1038/455001a.
- ^ “Sandia sees data management challenges spiral”. HPC Projects. 4 August 2009.
- ^ Reichman, O.J.; Jones, M.B.; Schildhauer, M.P. (2011). “Challenges and Opportunities of Open Data in Ecology”.Science 331 (6018): 703–5. doi:10.1126/science.1197962.
- ^ Hellerstein, Joe (9 November 2008). “Parallel Programming in the Age of Big Data”. Gigaom Blog.
- ^ Segaran, Toby; Hammerbacher, Jeff (2009). Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly Media. p. 257. ISBN 978-0-596-15711-1.
- ^ a b Hilbert & López 2011
- ^ IBM What is big data? — Bringing big data to the enterprise
- ^ Oracle and FSN, “Mastering Big Data: CFO Strategies to Transform Insight into Opportunity”, December 2012
- ^ Jacobs, A. (6 July 2009). “The Pathologies of Big Data”.ACMQueue.
- ^ Magoulas, Roger; Lorica, Ben (February 2009). “Introduction to Big Data”. Release 2.0 (Sebastopol CA: O’Reilly Media) (11).
- ^ Douglas, Laney. “3D Data Management: Controlling Data Volume, Velocity and Variety”. Gartner. Retrieved 6 February 2001.
- ^ Beyer, Mark. “Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data”. Gartner. Archived from the original on 10 July 2011. Retrieved 13 July 2011.
- ^ Douglas, Laney. “The Importance of ‘Big Data’: A Definition”. Gartner. Retrieved 21 June 2012.
- ^ “LHC Brochure, English version. A presentation of the largest and the most powerful particle accelerator in the world, the Large Hadron Collider (LHC), which started up in 2008. Its role, characteristics, technologies, etc. are explained for the general public.”. CERN-Brochure-2010-006-Eng. LHC Brochure, English version.. CERN. Retrieved 20 January 2013.
- ^ “LHC Guide, English version. A collection of facts and figures about the Large Hadron Collider (LHC) in the form of questions and answers.”. CERN-Brochure-2008-001-Eng. LHC Guide, English version.. CERN. Retrieved 20 January 2013.
- ^ Brumfiel, Geoff (19 January 2011). “High-energy physics: Down the petabyte highway”. Nature 469: pp. 282–83.doi:10.1038/469282a.
- ^ Preis, Tobias; Moat,, Helen Susannah; Stanley, H. Eugene; Bishop, Steven R. (2012). “Quantifying the Advantage of Looking Forward”. Scientific Reports 2: 350.doi:10.1038/srep00350. PMC 3320057.PMID 22482034.
- ^ Marks, Paul (April 5, 2012). “Online searches for future linked to economic success”. New Scientist. Retrieved April 9, 2012.
- ^ Johnston, Casey (April 6, 2012). “Google Trends reveals clues about the mentality of richer nations”. Ars Technica. Retrieved April 9, 2012.
- ^ Tobias Preis (2012-05-24). “Supplementary Information: The Future Orientation Index is available for download”. Retrieved 2012-05-24.
- ^ Kalil, Tom. “Big Data is a Big Deal”. White House. Retrieved 26 September 2012.
- ^ Executive Office of the President (March 2012). “Big Data Across the Federal Government”. White House. Retrieved 26 September 2012.
- ^ Hoover, J. Nicholas. “Government’s 10 Most Powerful Supercomputers”. Information Week. UBM. Retrieved 26 September 2012.
- ^ Webster, Phil. “Supercomputing the Climate: NASA’s Big Data Mission”. CSC World. Computer Sciences Corporation. Retrieved 1/18/2013.
- ^ FICO® Falcon® Fraud Manager
- ^ eBay Study: How to Build Trust and Improve the Shopping Experience
- ^ UN GLobal Pulse (2012). Big Data for Development: Opportunities and Challenges (White p. by Letouzé, E.). New York: United Nations. Retrieved fromhttp://www.unglobalpulse.org/projects/BigDataforDevelopment
- ^ WEF (World Economic Forum), & Vital Wave Consulting. (2012). Big Data, Big Impact: New Possibilities for International Development. World Economic Forum. Retrieved August 24, 2012, from http://www.weforum.org/reports/big-data-big-impact-new-possibilities-international-development
- ^ a b c d “Big Data for Development: From Information- to Knowledge Societies”, Martin Hilbert (2013), SSRN Scholarly Paper No. ID 2205145). Rochester, NY: Social Science Research Network;http://papers.ssrn.com/abstract=2205145
- ^ Manyika, James; Chui, Michael; Bughin, Jaques; Brown, Brad; Dobbs, Richard; Roxburgh, Charles; Byers, Angela Hung (May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.
- ^ “Future Directions in Tensor-Based Computation and Modeling”. May 2009.
- ^ Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N. (2011).“A Survey of Multilinear Subspace Learning for Tensor Data”. Pattern Recognition 44 (7): 1540–1551.doi:10.1016/j.patcog.2011.01.004.
- ^ Monash, Curt (30 April 2009). “eBay’s two enormous data warehouses”.
Monash, Curt (6 October 2010). “eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more”.
- ^ “How New Analytic Systems will Impact Storage”. September 2011.
- ^ “Obama Administration Unveils “Big Data” Initiative:Announces $200 Million In New R&D Investments”. The White House.
- ^ AMPLab at the University of California, Berkeley
- ^ “NSF Leads Federal Efforts In Big Data”. National Science Foundation (NSF). 29 March 2012.
- ^ Timothy Hunter; Teodor Moldovan; Matei Zaharia; Justin Ma; Michael Franklin; Pieter Abbeel; Alexandre Bayen (October 2011). “Scaling the Mobile Millennium System in the Cloud”.
- ^ David Patterson (5 December 2011). “Computer Scientists May Have What It Takes to Help Cure Cancer”. The New York Times.
- ^ “Secretary Chu Announces New Institute to Help Scientists Improve Massive Data Set Research on DOE Supercomputers”. “energy.gov”.
- ^ “Governor Patrick announces new initiative to strengthen Massachusetts’ position as a World leader in Big Data”. Commonwealth of Massachusetts.
- ^ Big Data @ CSAIL
- ^ Graham M. (2012). “Big data and the end of theory?”. The Guardian.
- ^ “Good Data Won’t Guarantee Good Decisions. Harvard Business Review”. Shah, Shvetank; Horne, Andrew; Capellá, Jaime;. HBR.org. Retrieved 8 September 2012.
- ^ Anderson, C. (2008, June 23). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine, (Science: Discoveries).http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
- ^ Ohm, Paul. “Don’t Build a Database of Ruin”. Harvard Business Review.
- ^ Danah Boyd (2010-04-29). “Privacy and Publicity in the Context of Big Data”. WWW 2010 conference. Retrieved 2011-04-18.
- ^ Jones, MB; Schildhauer, MP; Reichman, OJ; Bowers, S (2006). “The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere” (PDF). Annual Review of Ecology, Evolution, and Systematics 37 (1): 519–544.doi:10.1146/annurev.ecolsys.37.091305.110031.
How President Obama’s campaign used big data to rally individual voters
But for Democrats, there was bleak consolation in all this: Dan Wagner had seen it coming. When Wagner was hired as the DNC’s targeting director, in January of 2009, he became responsible for collecting voter information and analyzing it to help the committee approach individual voters by direct mail and phone. But he appreciated that the raw material he was feeding into his statistical models amounted to a series of surveys on voters’ attitudes and preferences. He asked the DNC’s technology department to develop software that could turn that information into tables, and he called the result Survey Manager.
It is yet another thing to be right five months before you’re going to lose. As the 2010 midterms approached, Wagner built statistical models for selected Senate races and 74 congressional districts. Starting in June, he began predicting the elections’ outcomes, forecasting the margins of victory with what turned out to be improbable accuracy. But he hadn’t gotten there with traditional polls. He had counted votes one by one. His first clue that the party was in trouble came from thousands of individual survey calls matched to rich statistical profiles in the DNC’s databases. Core Democratic voters were telling the DNC’s callers that they were much less likely to vote than statistical probability suggested. Wagner could also calculate how much the Democrats’ mobilization programs would do to increase turnout among supporters, and in most races he knew it wouldn’t be enough to cover the gap revealing itself in Survey Manager’s tables.
His congressional predictions were off by an average of only 2.5 percent. “That was a proof point for a lot of people who don’t understand the math behind it but understand the value of what that math produces,” says Mitch Stewart, Organizing for America’s director. “Once that first special [election] happened, his word was the gold standard at the DNC.”
The significance of Wagner’s achievement went far beyond his ability to declare winners months before Election Day. His approach amounted to a decisive break with 20th-century tools for tracking public opinion, which revolved around quarantining small samples that could be treated as representative of the whole. Wagner had emerged from a cadre of analysts who thought of voters as individuals and worked to aggregate projections about their opinions and behavior until they revealed a composite picture of everyone. His techniques marked the fulfillment of a new way of thinking, a decade in the making, in which voters were no longer trapped in old political geographies or tethered to traditional demographic categories, such as age or gender, depending on which attributes pollsters asked about or how consumer marketers classified them for commercial purposes. Instead, the electorate could be seen as a collection of individual citizens who could each be measured and assessed on their own terms. Now it was up to a candidate who wanted to lead those people to build a campaign that would interact with them the same way.
After the voters returned Obama to office for a second term, his campaign became celebrated for its use of technology—much of it developed by an unusual team of coders and engineers—that redefined how individuals could use the Web, social media, and smartphones to participate in the political process. A mobile app allowed a canvasser to download and return walk sheets without ever entering a campaign office; a Web platform called Dashboard gamified volunteer activity by ranking the most active supporters; and “targeted sharing” protocols mined an Obama backer’s Facebook network in search of friends the campaign wanted to register, mobilize, or persuade.
But underneath all that were scores describing particular voters: a new political currency that predicted the behavior of individual humans. The campaign didn’t just know who you were; it knew exactly how it could turn you into the type of person it wanted you to be.
Four years earlier, Dan Wagner had been working at a Chicago economic consultancy, using forecasting skills developed studying econometrics at the University of Chicago, when he fell for Barack Obama and decided he wanted to work on his home-state senator’s 2008 presidential campaign. Wagner, then 24, was soon in Des Moines, handling data entry for the state voter file that guided Obama to his crucial victory in the Iowa caucuses. He bounced from state to state through the long primary calendar, growing familiar with voter data and the ways of using statistical models to intelligently sort the electorate. For the general election, he was named lead targeter for the Great Lakes/Ohio River Valley region, the most intense battleground in the country.
After Obama’s victory, many of his top advisors decamped to Washington to make preparations for governing. Wagner was told to stay behind and serve on a post-election task force that would review a campaign that had looked, to the outside world, technically flawless.
In the 2008 presidential election, Obama’s targeters had assigned every voter in the country a pair of scores based on the probability that the individual would perform two distinct actions that mattered to the campaign: casting a ballot and supporting Obama. These scores were derived from an unprecedented volume of ongoing survey work. For each battleground state every week, the campaign’s call centers conducted 5,000 to 10,000 so-called short-form interviews that quickly gauged a voter’s preferences, and 1,000 interviews in a long-form version that was more like a traditional poll. To derive individual-level predictions, algorithms trawled for patterns between these opinions and the data points the campaign had assembled for every voter—as many as one thousand variables each, drawn from voter registration records, consumer data warehouses, and past campaign contacts.
This innovation was most valued in the field. There, an almost perfect cycle of microtargeting models directed volunteers to scripted conversations with specific voters at the door or over the phone. Each of those interactions produced data that streamed back into Obama’s servers to refine the models pointing volunteers toward the next door worth a knock. The efficiency and scale of that process put the Democrats well ahead when it came to profiling voters. John McCain’s campaign had, in most states, run its statistical model just once, assigning each voter to one of its microtargeting segments in the summer. McCain’s advisors were unable to recalculate the probability that those voters would support their candidate as the dynamics of the race changed. Obama’s scores, on the other hand, adjusted weekly, responding to new events like Sarah Palin’s vice-presidential nomination or the collapse of Lehman Brothers.
Within the campaign, however, the Obama data operations were understood to have shortcomings. As was typical in political information infrastructure, knowledge about people was stored separately from data about the campaign’s interactions with them, mostly because the databases built for those purposes had been developed by different consultants who had no interest in making their systems work together.
But the task force knew the next campaign wasn’t stuck with that situation. Obama would run his final race not as an insurgent against a party establishment, but as the establishment itself. For four years, the task force members knew, their team would control the Democratic Party’s apparatus. Their demands, not the offerings of consultants and vendors, would shape the marketplace. Their report recommended developing a “constituent relationship management system” that would allow staff across the campaign to look up individuals not just as voters or volunteers or donors or website users but as citizens in full. “We realized there was a problem with how our data and infrastructure interacted with the rest of the campaign, and we ought to be able to offer it to all parts of the campaign,” says Chris Wegrzyn, a database applications developer who served on the task force.
Wegrzyn became the DNC’s lead targeting developer and oversaw a series of costly acquisitions, all intended to free the party from the traditional dependence on outside vendors. The committee installed a Siemens Enterprise System phone-dialing unit that could put out 1.2 million calls a day to survey voters’ opinions. Later, party leaders signed off on a $280,000 license to use Vertica software from Hewlett-Packard that allowed their servers to access not only the party’s 180-million-person voter file but all the data about volunteers, donors, and those who had interacted with Obama online.
Many of those who went to Washington after the 2008 election in order to further the president’s political agenda returned to Chicago in the spring of 2011 to work on his reëlection. The chastening losses they had experienced in Washington separated them from those who had known only the ecstasies of 2008. “People who did ’08, but didn’t do ’10, and came back in ’11 or ’12—they had the hardest culture clash,” says Jeremy Bird, who became national field director on the reëlection campaign. But those who went to Washington and returned to Chicago developed a particular appreciation for Wagner’s methods of working with the electorate at an atomic level. It was a way of thinking that perfectly aligned with their simple theory of what it would take to win the president reëlection: get everyone who had voted for him in 2008 to do it again. At the same time, they knew they would need to succeed at registering and mobilizing new voters, especially in some of the fastest-growing demographic categories, to make up for any 2008 voters who did defect.
Obama’s campaign began the election year confident it knew the name of every one of the 69,456,897 Americans whose votes had put him in the White House. They may have cast those votes by secret ballot, but Obama’s analysts could look at the Democrats’ vote totals in each precinct and identify the people most likely to have backed him. Pundits talked in the abstract about reassembling Obama’s 2008 coalition. But within the campaign, the goal was literal. They would reassemble the coalition, one by one, through personal contacts.
But for all its reliance on data, the 2008 Obama campaign had remained insulated from the most important methodological innovation in 21st-century politics. In 1998, Yale professors Don Green and Alan Gerber conducted the first randomized controlled trial in modern political science, assigning New Haven voters to receive nonpartisan election reminders by mail, phone, or in-person visit from a canvasser and measuring which group saw the greatest increase in turnout. The subsequent wave of field experiments by Green, Gerber, and their followers focused on mobilization, testing competing modes of contact and get-out-the-vote language to see which were most successful.
The first Obama campaign used the findings of such tests to tweak call scripts and canvassing protocols, but it never fully embraced the experimental revolution itself. After Dan Wagner moved to the DNC, the party decided it would start conducting its own experiments. He hoped the committee could become “a driver of research for the Democratic Party.”
To that end, he hired the Analyst Institute, a Washington-based consortium founded under the AFL-CIO’s leadership in 2006 to coördinate field research projects across the electioneering left and distribute the findings among allies. Much of the experimental world’s research had focused on voter registration, because that was easy to measure. The breakthrough was that registration no longer had to be approached passively; organizers did not have to simply wait for the unenrolled to emerge from anonymity, sign a form, and, they hoped, vote. New techniques made it possible to intelligently profile nonvoters: commercial data warehouses sold lists of all voting-age adults, and comparing those lists with registration rolls revealed eligible candidates, each attached to a home address to which an application could be mailed. Applying microtargeting models identified which nonregistrants were most likely to be Democrats and which ones Republicans.
The Obama campaign embedded social scientists from the Analyst Institute among its staff. Party officials knew that adding new Democratic voters to the registration rolls was a crucial element in their strategy for 2012. But already the campaign had ambitions beyond merely modifying nonparticipating citizens’ behavior through registration and mobilization. It wanted to take on the most vexing problem in politics: changing voters’ minds.
The expansion of individual-level data had made possible the kind of testing that could help do that. Experimenters had typically calculated the average effect of their interventions across the entire population. But as campaigns developed deep portraits of the voters in their databases, it became possible to measure the attributes of the people who were actually moved by an experiment’s impact. A series of tests in 2006 by the women’s group Emily’s List had illustrated the potential of conducting controlled trials with microtargeting databases. When the group sent direct mail in favor of Democratic gubernatorial candidates, it barely budged those whose scores placed them in the middle of the partisan spectrum; it had a far greater impact upon those who had been profiled as soft (or nonideological) Republicans.
That test, and others that followed, demonstrated the limitations of traditional targeting. Such techniques rested on a series of long-standing assumptions—for instance, that middle-of-the-roaders were the most persuadable and that infrequent voters were the likeliest to be captured in a get-out-the-vote drive. But the experiments introduced new uncertainty. People who were identified as having a 50 percent likelihood of voting for a Democrat might in fact be torn between the two parties, or they might look like centrists only because no data attached to their records pushed a partisan prediction in one direction or another. “The scores in the middle are the people we know less about,” says Chris Wyant, a 2008 field organizer who became the campaign’s general election director in Ohio four years later. “The extent to which we were guessing about persuasion was not lost on any of us.”
One way the campaign sought to identify the ripest targets was through a series of what the Analyst Institute called “experiment-informed programs,” or EIPs, designed to measure how effective different types of messages were at moving public opinion.
The traditional way of doing this had been to audition themes and language in focus groups and then test the winning material in polls to see which categories of voters responded positively to each approach. Any insights were distorted by the artificial settings and by the tiny samples of demographic subgroups in traditional polls. “You’re making significant resource decisions based on 160 people?” asks Mitch Stewart, director of the Democratic campaign group Organizing for America. “Isn’t that nuts? And people have been doing that for decades!”
An experimental program would use those steps to develop a range of prospective messages that could be subjected to empirical testing in the real world. Experimenters would randomly assign voters to receive varied sequences of direct mail—four pieces on the same policy theme, each making a slightly different case for Obama—and then use ongoing survey calls to isolate the attributes of those whose opinions changed as a result.
In March, the campaign used this technique to test various ways of promoting the administration’s health-care policies. One series of mailers described Obama’s regulatory reforms; another advised voters that they were now entitled to free regular check-ups and ought to schedule one. The experiment revealed how much voter response differed by age, especially among women. Older women thought more highly of the policies when they received reminders about preventive care; younger women liked them more when they were told about contraceptive coverage and new rules that prohibited insurance companies from charging women more.
When Paul Ryan was named to the Republican ticket in August, Obama’s advisors rushed out an EIP that compared different lines of attack about Medicare. The results were surprising. “The electorate [had seemed] very inelastic,” says Terry Walsh, who coördinated the campaign’s polling and paid-media spending. “In fact, when we did the Medicare EIPs, we got positive movement that was very heartening, because it was at a time when we were not seeing a lot of movement in the electorate.” But that movement came from quarters where a traditional campaign would never have gone hunting for minds it could change. The Obama team found that voters between 45 and 65 were more likely to change their views about the candidates after hearing Obama’s Medicare arguments than those over 65, who were currently eligible for the program.
A similar strategy of targeting an unexpected population emerged from a July EIP testing Obama’s messages aimed at women. The voters most responsive to the campaign’s arguments about equal-pay measures and women’s health, it found, were those whose likelihood of supporting the president was scored at merely 20 and 40 percent. Those scores suggested that they probably shared Republican attitudes; but here was one thing that could pull them to Obama. As a result, when Obama unveiled a direct-mail track addressing only women’s issues, it wasn’t to shore up interest among core parts of the Democratic coalition, but to reach over for conservatives who were at odds with their party on gender concerns. “The whole goal of the women’s track was to pick off votes for Romney,” says Walsh. “We were able to persuade people who fell low on candidate support scores if we gave them a specific message.”
At the same time, Obama’s campaign was pursuing a second, even more audacious adventure in persuasion: one-on-one interaction. Traditionally, campaigns have restricted their persuasion efforts to channels like mass media or direct mail, where they can control presentation, language, and targeting. Sending volunteers to persuade voters would mean forcing them to interact with opponents, or with voters who were undecided because they were alienated from politics on delicate issues like abortion. Campaigns have typically resisted relinquishing control of ground-level interactions with voters to risk such potentially combustible situations; they felt they didn’t know enough about their supporters or volunteers. “You can have a negative impact,” says Jeremy Bird, who served as national deputy director of Organizing for America. “You can hurt your candidate.”
In February, however, Obama volunteers attempted 500,000 conversations with the goal of winning new supporters. Voters who’d been randomly selected from a group identified as persuadable were polled after a phone conversation that began with a volunteer reading from a script. “We definitely find certain people moved more than other people,” says Bird. Analysts identified their attributes and made them the core of a persuasion model that predicted, on a scale of 0 to 10, the likelihood that a voter could be pulled in Obama’s direction after a single volunteer interaction. The experiment also taught Obama’s field department about its volunteers. Those in California, which had always had an exceptionally mature volunteer organization for a non-battleground state, turned out to be especially persuasive: voters called by Californians, no matter what state they were in themselves, were more likely to become Obama supporters.
With these findings in hand, Obama’s strategists grew confident that they were no longer restricted to advertising as a channel for persuasion. They began sending trained volunteers to knock on doors or make phone calls with the objective of changing minds.
That dramatic shift in the culture of electioneering was felt on the streets, but it was possible only because of advances in analytics. Chris Wegrzyn, a database applications developer, developed a program code-named Airwolf that matched county and state lists of people who had requested mail ballots with the campaign’s list of e-mail addresses. Likely Obama supporters would get regular reminders from their local field organizers, asking them to return their ballots, and, once they had, a message thanking them and proposing other ways to be involved in the campaign. The local organizer would receive daily lists of the voters on his or her turf who had outstanding ballots so that the campaign could follow up with personal contact by phone or at the doorstep. “It is a fundamental way of tying together the online and offline worlds,” says Wagner.
Wagner, however, was turning his attention beyond the field. By June of 2011, he was chief analytics officer for the campaign and had begun making the rounds of the other units at headquarters, from fund-raising to communications, offering to help “solve their problems with data.” He imagined the analytics department—now a 54-person staff, housed in a windowless office known as the Cave—as an “in-house consultancy” with other parts of the campaign as its clients. “There’s a process of helping people learn about the tools so they can be a participant in the process,” he says. “We essentially built products for each of those various departments that were paired up with a massive database we had.”
As job notices seeking specialists in text analytics, computational advertising, and online experiments came out of the incumbent’s campaign, Mitt Romney’s advisors at the Republicans’ headquarters in Boston’s North End watched with a combination of awe and perplexity. Throughout the primaries, Romney had appeared to be the only Republican running a 21st-century campaign, methodically banking early votes in states like Florida and Ohio before his disorganized opponents could establish operations there.
But the Republican winner’s relative sophistication in the primaries belied a poverty of expertise compared with the Obama campaign. Since his first campaign for governor of Massachusetts, in 2002, Romney had relied upon TargetPoint Consulting, a Virginia firm that was then a pioneer in linking information from consumer data warehouses to voter registration records and using it to develop individual-level predictive models. It was TargetPoint’s CEO, Alexander Gage, who had coined the term “microtargeting” to describe the process, which he modeled on the corporate world’s approach to customer relationship management.
Such techniques had offered George W. Bush’s reëlection campaign a significant edge in targeting, but Republicans had done little to institutionalize that advantage in the years since. By 2006, Democrats had not only matched Republicans in adopting commercial marketing techniques; they had moved ahead by integrating methods developed in the social sciences.
Romney’s advisors knew that Obama was building innovative internal data analytics departments, but they didn’t feel a need to match those activities. “I don’t think we thought, relative to the marketplace, we could be the best at data in-house all the time,” Romney’s digital director, Zac Moffatt, said in July. “Our idea is to find the best firms to work with us.” As a result, Romney remained dependent on TargetPoint to develop voter segments, often just once, and then deliver them to the campaign’s databases. That was the structure Obama had abandoned after winning the nomination in 2008.
In May a TargetPoint vice president, Alex Lundry, took leave from his post at the firm to assemble a data science unit within Romney’s headquarters. To round out his team, Lundry brought in Tom Wood, a University of Chicago postdoctoral student in political science, and Brent McGoldrick, a veteran of Bush’s 2004 campaign who had left politics for the consulting firm Financial Dynamics (later FTI Consulting), where he helped financial-services, health-care, and energy companies communicate better. But Romney’s data science team was less than one-tenth the size of Obama’s analytics department. Without a large in-house staff to handle the massive national data sets that made it possible to test and track citizens, Romney’s data scientists never tried to deepen their understanding of individual behavior. Instead, they fixated on trying to unlock one big, persistent mystery, which Lundry framed this way: “How can we get a sense of whether this advertising is working?”
“You usually get GRPs and tracking polls,” he says, referring to the gross ratings points that are the basic unit of measuring television buys. “There’s a very large causal leap you have to make from one to the other.”
Lundry decided to focus on more manageable ways of measuring what he called the information flow. His team converted topics of political communication into discrete units they called “entities.” They initially classified 200 of them, including issues like the auto industry bailout, controversies like the one surrounding federal funding for the solar-power company Solyndra, and catchphrases like “the war on women.” When a new concept (such as Obama’s offhand remark, during a speech about our common dependence on infrastructure, that “you didn’t build that”) emerged as part of the election-year lexicon, the analysts added it to the list. They tracked each entity on the National Dialogue Monitor, TargetPoint’s system for measuring the frequency and tone with which certain topics are mentioned across all media. TargetPoint also integrated content collected from newspaper websites and closed-caption transcripts of broadcast programs. Lundry’s team aimed to examine how every entity fared over time in each of two categories: the informal sphere of social media, especially Twitter, and the journalistic product that campaigns call earned press coverage.
Ultimately, Lundry wanted to assess the impact that each type of public attention had on what mattered most to them: Romney’s position in the horse race. He turned to vector autoregression models, which equities traders use to isolate the influence of single variables on market movements. In this case, Lundry’s team looked for patterns in the relationship between the National Dialogue Monitor’s data and Romney’s numbers in Gallup’s daily tracking polls. By the end of July, they thought they had identified a three-step process they called “Wood’s Triangle.”
Within three or four days of a new entity’s entry into the conversation, either through paid ads or through the news cycle, it was possible to make a well-informed hypothesis about whether the topic was likely to win media attention by tracking whether it generated Twitter chatter. That informal conversation among political-class elites typically led to traditional print or broadcast press coverage one to two days later, and that, in turn, might have an impact on the horse race. “We saw this process over and over again,” says Lundry.
They began to think of ads as a “shock to the system”—a way to either introduce a new topic or restore focus on an area in which elite interest had faded. If an entity didn’t gain its own energy—as when the Republicans charged over the summer that the White House had waived the work requirements in the federal welfare rules—Lundry would propose a “re-shock to the system” with another ad on the subject five to seven days later. After 12 to 14 days, Lundry found, an entity had moved through the system and exhausted its ability to move public opinion—so he would recommend to the campaign’s communications staff that they move on to something new.
Those insights offered campaign officials a theory of information flows, but they provided no guidance in how to allocate campaign resources in order to win the Electoral College. Assuming that Obama had superior ground-level data and analytics, Romney’s campaign tried to leverage its rivals’ strategy to shape its own; if Democrats thought a state or media market was competitive, maybe that was evidence that Republicans should think so too. “We were necessarily reactive, because we were putting together the plane as it took off,” Lundry says. “They had an enormous head start on us.”
Romney’s political department began holding regular meetings to look at where in the country the Obama campaign was focusing resources like ad dollars and the president’s time. The goal was to try to divine the calculations behind those decisions. It was, in essence, the way Microsoft’s Bing approached Google: trying to reverse-engineer the market leader’s code by studying the visible output. “We watch where the president goes,” Dan Centinello, the Romney deputy political director who oversaw the meetings, said over the summer.
Obama’s media-buying strategy proved particularly hard to decipher. In early September, as part of his standard review, Lundry noticed that the week after the Democratic convention, Obama had aired 68 ads in Dothan, Alabama, a town near the Florida border. Dothan was one of the country’s smallest media markets, and Alabama one of the safest Republican states. Even though the area was known to savvy ad buyers as one of the places where a media market crosses state lines, Dothan TV stations reached only about 9,000 Florida voters, and around 7,000 of them had voted for John McCain in 2008. “This is a hard-core Republican media market,” Lundry says. “It’s incredibly tiny. But they were advertising there.”
Romney’s advisors might have formed a theory about the broader media environment, but whatever was sending Obama hunting for a small pocket of votes was beyond their measurement. “We could tell,” says McGoldrick, “that there was something in the algorithms that was telling them what to run.”
Davidsen was working at Navic Networks, a Microsoft-owned company that wrote code for set-top cable boxes to create a record of a user’s DVR or tuner history, when she heeded Wagner’s call. One year before Election Day, she started work in the campaign’s technology department to serve as product manager for Narwhal. That was the code name, borrowed from a tusked whale, for an ambitious effort to match records from previously unconnected databases so that a user’s online interactions with the campaign could be synchronized. With Narwhal, e-mail blasts asking people to volunteer could take their past donation history into consideration, and the algorithms determining how much a supporter would be asked to contribute could be shaped by knowledge about his or her reaction to previous solicitations. This integration enriched a technique, common in website development, that Obama’s online fund-raising efforts had used to good effect in 2008: the A/B test, in which users are randomly directed to different versions of a thing and their responses are compared. Now analysts could leverage personal data to identify the attributes of those who responded, and use that knowledge to refine subsequent appeals. “You can cite people’s other types of engagement,” says Amelia Showalter, Obama’s director of digital analytics. “We discovered that there were a lot of things that built goodwill, like signing the president’s birthday card or getting a free bumper sticker, that led them to become more engaged with the campaign in other ways.”
If online communication had been the aspect of the 2008 campaign subjected to the most rigorous empirical examination—it’s easy to randomly assign e-mails in an A/B test and compare click-through rates or donation levels—mass-media strategy was among those that received the least. Television and radio ads had to be purchased by geographic zone, and the available data on who watches which channels or shows, collected by research firms like Nielsen and Scarborough, often included little more than viewer age and gender. That might be good enough to guide buys for Schick or Foot Locker, but it’s of limited value for advertisers looking to define audiences in political terms.
As campaign manager Jim Messina prepared to spend as much as half a billion dollars on mass media for Obama’s reëlection, he set out to reinvent the process for allocating resources across broadcast, cable, satellite, and online channels. “If you think about the universe of possible places for an advertiser, it’s almost infinite,” says Amy Gershkoff, who was hired as the campaign’s media-planning director on the strength of her successful negotiations, while at her firm Changing Targets in 2009, to link the information from cable systems to individual microtargeting profiles. “There are tens of millions of opportunities where a campaign can put its next dollar. You have all this great, robust voter data that doesn’t fit together with the media data. How you knit that together is a challenge.”
By the start of 2012, Wagner had deftly wrested command of media planning into his own department. As he expanded the scope of analytics, he defined his purview as “the study and practice of resource optimization for the purpose of improving programs and earning votes more efficiently.” That usually meant calculating, for any campaign activity, the number of votes gained through a given amount of contact at a given cost.
But when it came to buying media, such calculations had been simply impossible, because campaigns were unable to link what they knew about voters to what cable providers knew about their customers. Obama’s advisors decided that the data made available in the private sector had long led political advertisers to ask the wrong questions. Walsh says of the effort to reimagine the media-targeting process: “It was not to get a better understanding of what 35-plus women watch on TV. It was to find out how many of our persuadable voters were watching those dayparts.”
Davidsen, whose previous work had left her intimately familiar with the rich data sets held in set-top boxes, understood that a lot of that data was available in the form of tuner and DVR histories collected by cable providers and then aggregated by research firms. For privacy reasons, however, the information was not available at the individual level. “The hardest thing in media buying right now is the lack of information,” she says.
Davidsen began negotiating to have research firms repackage their data in a form that would permit the campaign to access the individual histories without violating the cable providers’ privacy standards. Under a $350,000 deal she worked out with one company, Rentrak, the campaign provided a list of persuadable voters and their addresses, derived from its microtargeting models, and the company looked for them in the cable providers’ billing files. When a record matched, Rentrak would issue it a unique household ID that identified viewing data from a single set-top box but masked any personally identifiable information.
The Obama campaign had created its own television ratings system, a kind of Nielsen in which the only viewers who mattered were those not yet fully committed to a presidential candidate. But Davidsen had to get the information into a practical form by early May, when Obama strategists planned to start running their anti-Romney ads. She oversaw the development of a software platform the Obama staff called the Optimizer, which broke the day into 96 quarter-hour segments and assessed which time slots across 60 channels offered the greatest number of persuadable targets per dollar. (By September, she had unlocked an even richer trove of data: a cable system in Toledo, Ohio, that tracked viewers’ tuner histories by the second.) “The revolution of media buying in this campaign,” says Terry Walsh, who coördinated the campaign’s polling and advertising spending, “was to turn what was a broadcast medium into something that looks a lot more like a narrowcast medium.”
When the Obama campaign did use television as a mass medium, it was because the Optimizer had concluded it would be a more efficient way of reaching persuadable targets. Sometimes a national cable ad was a better bargain than a large number of local buys in the 66 media markets reaching battleground states. But the occasional national buy also had other benefits. It could boost fund-raising and motivate volunteers in states that weren’t essential to Obama’s Electoral College arithmetic. And, says Davidsen, “it helps hide some of the strategy of your buying.”
Even without that tactic, Obama’s buys perplexed the Romney analysts in Boston. They had invested in their own media-intelligence platform, called Centraforce. It used some of the same aggregated data sources that were feeding into the Optimizer, and at times both seemed to send the campaigns to the same unlikely ad blocks—for example, in reruns on TV Land. But there was a lot more to what Alex Lundry, who created Romney’s data science unit, called Obama’s “highly variable” media strategy. Many of the Democrats’ ads were placed in fringe markets, on marginal stations, and at odd times where few political candidates had ever seen value. Romney’s data scientists simply could not decode those decisions without the voter models or persuasion experiments that helped Obama pick out individual targets. “We were never able to figure out the level of advertising and what they were trying to do,” says Romney data analyst Brent McGoldrick. “It wasn’t worth reverse-engineering, because what are you going to do?”
Although the voter opinion tables that emerged from the Cave looked a lot like polls, the analysts who produced them were disinclined to call them polls. The campaign had plenty of those, generated by a public-opinion team of eight outside firms, and new arrivals at the Chicago headquarters were shocked by the variegated breadth of the research that arrived on their desks daily. “We believed in combining the qual, which we did more than any campaign ever, with the quant, which we [also] did more than any other campaign, to make sure all communication for every level of the campaign was informed by what they found,” says David Simas, the director of opinion research.
Simas considered himself the “air-traffic controller” for such research, which was guided by a series of voter diaries that Obama’s team commissioned as it prepared for the reëlection campaign. “We needed to do something almost divorced from politics and get to the way they’re seeing their lives,” he says. The lead pollster, Joel Benenson, had respondents write about their experiences. The entries frequently used the word “disappointment,” which helped explain attitudes toward Obama’s administration but also spoke to a broader dissatisfaction with economic conditions. “That became the foundation for our entire research program,” says Simas.
Obama’s advisors used those diaries to develop messages that contrasted Obama with Romney as a fighter for the middle class. Benenson’s national polls tested language to see which affected voters’ responses in survey experiments and direct questioning. A quartet of polling firms were assigned specific states and asked to figure out which national themes fit best with local concerns. Eventually, Obama’s media advisors created more than 500 ads and tested them before an online sample of viewers selected by focus-group director David Binder.
But the campaign had to play defense, too. When something potentially damaging popped up in the news, like Democratic consultant Hilary Rosen’s declaration that Ann Romney had “never worked a day in her life,” Simas checked in with the Community, a private online bulletin board populated by 100 undecided voters Binder had recruited. Simas would monitor Community conversations to see which news events penetrated voter consciousness. Sometimes he had Binder show its members controversial material—like a video clip of Obama’s “You didn’t build that” comment—and ask if it changed their views of the candidate. “For me, it was a very quick way to draw back and determine whether something was a problem or not a problem,” says Simas.
When Wagner started packaging his department’s research into something that campaign leadership could read like a poll, a pattern became apparent. Obama’s numbers in key battleground states were low in the analytic tables, but Romney’s were too. There were simply more undecided voters in such states—sometimes nearly twice as many as the traditional pollsters found. A basic methodological distinction explained this discrepancy: microtargeting models required interviewing a lot of unlikely voters to give shape to a profile of what a nonvoter looked like, while pollsters tracking the horse race wanted to screen more rigorously for those likely to cast a ballot. The rivalry between the two units trying to measure public opinion grew intense: the analytic polls were a threat to the pollsters’ primacy and, potentially, to their business model. “I spent a lot of time within the campaign explaining to people that the numbers we get from analytics and the numbers we get from external pollsters did not need strictly to be reconciled,” says Walsh. “They were different.”
The scope of the analytic research enabled it to pick up movements too small for traditional polls to perceive. As Simas reviewed Wagner’s analytic tables in mid-October, he was alarmed to see that what had been a Romney lead of one to two points in Green Bay, Wisconsin, had grown into an advantage of between six and nine. Green Bay was the only media market in the state to experience such a shift, and there was no obvious explanation. But it was hard to discount. Whereas a standard 800-person statewide poll might have reached 100 respondents in the Green Bay area, analytics was placing 5,000 calls in Wisconsin in each five-day cycle—and benefiting from tens of thousands of other field contacts—to produce microtargeting scores. Analytics was talking to as many people in the Green Bay media market as traditional pollsters were talking to across Wisconsin every week. “We could have the confidence level to say, ‘This isn’t noise,’” says Simas. So the campaign’s media buyers aired an ad attacking Romney on outsourcing and beseeched Messina to send former president Bill Clinton and Obama himself to rallies there. (In the end, Romney took the county 50.3 to 48.5 percent.)
For the most part, however, the analytic tables demonstrated how stable the electorate was, and how predictable individual voters could be. Polls from the media and academic institutions may have fluctuated by the hour, but drawing on hundreds of data points to judge whether someone was a likely voter proved more reliable than using a seven-question battery like Gallup’s to do the same. “When you see this Pogo stick happening with the public data—the electorate is just not that volatile,” says Mitch Stewart, director of the Democratic campaign group Organizing for America. The analytic data offered a source of calm.
Romney’s advisors were similarly sanguine, but they were losing. They, too, believed it possible to project the composition of the electorate, relying on a method similar to Gallup’s: pollster Neil Newhouse asked respondents how likely they were to cast a ballot. Those who answered that question with a seven or below on a 10-point scale were disregarded as not inclined to vote. But that ignored the experimental methods that made it possible to measure individual behavior and the impact that a campaign itself could have on a citizen’s motivation. As a result, the Republicans failed to account for voters that the Obama campaign could be mobilizing even if they looked to Election Day without enthusiasm or intensity.
On the last day of the race, Wagner and his analytics staff left the Cave and rode the elevator up one floor in the campaign’s Chicago skyscraper to join members of other departments in a boiler room established to help track votes as they came in. Already, for over a month, Obama’s analysts had been counting ballots from states that allowed citizens to vote early. Each day, the campaign overlaid the lists of early voters released by election authorities with its modeling scores to project how many votes they could claim as their own.
By Election Day, Wagner’s analytic tables turned into predictions. Before the polls opened in Ohio, authorities in Hamilton County, the state’s third-largest and home to Cincinnati, released the names of 103,508 voters who had cast early ballots over the previous month. Wagner sorted them by microtargeting projections and found that 58,379 had individual support scores over 50.1—that is, the campaign’s models predicted that they were more likely than not to have voted for Obama. That amounted to 56.4 percent of the county’s votes, or a raw lead of 13,249 votes over Romney. Early ballots were the first to be counted after Ohio’s polls closed, and Obama’s senior staff gathered around screens in the boiler room to see the initial tally. The numbers settled almost exactly where Wagner had said they would: Obama got 56.6 percent of the votes in Hamilton County. In Florida, he was as close to the mark; Obama’s margin was only two-tenths of a percent off. “After those first two numbers, we knew,” says Bird. “It was dead-on.”
When Obama was reëlected, and by a far larger Electoral College margin than most outsiders had anticipated, his staff was exhilarated but not surprised. The next morning, Mitch Stewart sat in the boiler room, alone, monitoring the lagging votes as they came into Obama’s servers from election authorities in Florida, the last state to name a winner. The presidency was no longer at stake; the only thing that still hung in the balance was the accuracy of the analytics department’s predictions.
A few days after the election, as Florida authorities continued to count provisional ballots, a few staff members were directed, as four years before, to remain in Chicago. Their instructions were to produce another post-mortem report summing up the lessons of the past year and a half. The undertaking was called the Legacy Project, a grandiose title inspired by the idea that the innovations of Obama 2012 should be translated not only to the campaign of the next Democratic candidate for president but also to governance. Obama had succeeded in convincing some citizens that a modest adjustment to their behavior would affect, however marginally, the result of an election. Could he make them feel the same way about Congress?
Simas, who had served in the White House before joining the team, marveled at the intimacy of the campaign. Perhaps more than anyone else at headquarters, he appreciated the human aspect of politics. This had been his first presidential election, but before he became a political operative, Simas had been a politician himself, serving on the city council and school board in his hometown of Taunton, Massachusetts. He ran for office by knocking on doors and interacting individually with constituents (or those he hoped would become constituents), trying to track their moods and expectations.
In many respects, analytics had made it possible for the Obama campaign to recapture that style of politics. Though the old guard may have viewed such techniques as a disruptive force in campaigns, they enabled a presidential candidate to view the electorate the way local candidates do: as a collection of people who make up a more perfect union, each of them approachable on his or her terms, their changing levels of support and enthusiasm open to measurement and, thus, to respect. “What that gave us was the ability to run a national presidential campaign the way you’d do a local ward campaign,” Simas says. “You know the people on your block. People have relationships with one another, and you leverage them so you know the way they talk about issues, what they’re discussing at the coffee shop.”
Few events in American life other than a presidential election touch 126 million adults, or even a significant fraction that many, on a single day. Certainly no corporation, no civic institution, and very few government agencies ever do. Obama did so by reducing every American to a series of numbers. Yet those numbers somehow captured the individuality of each voter, and they were not demographic classifications. The scores measured the ability of people to change politics—and to be changed by it.
This story was updated on December 18 to correct the description of Neil Newhouse’s poll for the Republicans