Information is more abundant than ever. Day after day, the flood of data is growing at exponential rates. Barely ten years ago, the main issue for politics and industries, was to hold a firm grip on this daunting explosion. Today, the challenge consists in being able, in real-time, to take advantage and transform into value massive swaths of data.
The digital data deluge, previously mentioned in ParisTech Review by an interview of George Day and David Reibstein, bears an impact not only on the marketing sector. The consequences reach to all production organizations and, beyond competitiveness issues, to national economies. Those capable of using this data will successfully improve the processes involved in decision-making and will lead one step ahead, in terms of understanding public opinion and cultural trends – but also, regarding what is at stake in their own organization. Of course, they would first need to give themselves the appropriate means to do so. That’s the most difficult task faced by those who confront the “Big Data” challenge: both promise and peril. We’re talking of a technical challenge, but also of an intellectual one: the analysis capabilities of computers that will sift through these tremendous amounts of data answer only to part of the problem.
The information era
A debate first arose in the academic circles, when Peter Lyman and Hal R. Varian, from the Berkeley University of California, undertook to measure the amount of information produced and stored on media, and especially, on digital media. A first report was released in 2000 and updated in 2003, How Much Information. The report highlighted and confirmed an assumed fact: not only does the amount of information double periodically, but also, these doubling cycles are every time shorter. Analysts put forward many reasons: the proliferation of digital contents, due to the creation and digitizing of documents, and especially, of images. The creation of electronic databases in many firms, who digitized would all their physical data, also contributed to this tendency. Last but not least, a campaign of digitizing the world’s largest public libraries, led since the beginning of the 1990s.
Lyman and Varian also hinted the explosion of online exchanges within the Web 2.0, in which anybody can virtually become an editor. The explosion of social-networking in recent years sped up this tendency.
In this context, search engines such as Google played a leading role and… also started to create information: indeed, metadata (such as rankings, tags, and indexation) is information of its own kind. Gigantic databases have emerged and their use has produced new information.
Metadata has grown from raw data and now, plays a larger part in the data flow. Raw data could be anything from your banking information to the picture you post on a sharing platform. Metadata would be your banking profile, information about yourself, or about the people who have seen your picture, left a comment about it and also, the digital record of the people who access your picture…
Except maybe for some Amazonian Indians living far amidst the forest, all human beings leave some kind of trail or footprint inside the digital environment. Inhabitants of first-world countries leave many, from blog posts to online transactions, or geo-locations from their smartphones. Several players soon predicted the potential value of these digital trails and have learned to use them. It’s especially the case of Google and Facebook, who use the information to tailor the advertisements which appear on your screen. Other players, such as insurance companies, might provide personal data to their actuaries – depending on the country’s law.
Metadata is constantly renewed and information can be perceived as an ever-changing world of transitory flows. These flows feed storages and databases, but could also be filtered in real-time, if only they were considered as information in movement and not as a dead bulk of data. That’s exactly why Big Data now draws all attention.
A computer revolution
Computers have always been around, of course, but up until now, they processed stable, closed and relatively small databases. What is new, are the growing scale and the constant renewal of information, which lead to gigantic flows of data that pour in and out of these “open” databases. Not to mention the growing sophistication of formats and the interwoven nature of databases. All these new features discard for complete traditional management tools.
Obviously, storage costs tend to diminish almost as fast as storage volume increases. Besides, some new tools such as supercomputers can handle more easily these gigantic databases.
But aside of hardware issues, it’s the software nature of analysis tools which is challenged today. Traditional decision-making tools, for instance, are completely overtaken by the mass of data and its fragmentation. The Big Data information is not wholly contained in databases: it lies, above all, outside. The database is a virtual entity, so to speak.
The growth of the Internet and of mass services has proven a great challenge for database management systems. The idea of relational database (where information is stored in tables and relations) was shattered to pieces by the transitory nature of streaming data. Structured query languages (SQL), which were designed to operate in closed systems (their main tasks are to define and order data), were wiped away, as they no longer work efficiently in open environments.
The new database management systems (DBMS) were forced to give up some features to increase their computing power. New tools have emerged: column-oriented DBMS rather than line-oriented; in-memory databases, which rely on main memory rather than hard drive storage. In-memory databases are faster than disk-optimized databases since the data access and internal optimization algorithms are simpler: this provides faster reading performances.
A real breakthrough was achieved with the arrival of real-time tools, fueled by incoming data streams and cloud computing facilities. It’s the case for Streambase or Hadoop, a free computer platform which handles parallel data threads on large hardware clusters. The processing divides in two categories of operations: mapping, which consists in dealing with data subsets; reducing, which consists in collecting and combining the results of mappers.
This cloud computing technique was adopted by large social networks with the ambition of relocating endlessly data processing: every active end-user represents an amount of data and also a virtual computing capability.
What use can be made of this data? Graphs count among the most innovative tools when it comes to mapping the interactions between different players of a network. As explained by Henri Verdier, Google+, the new social networking service offered by Google, almost entirely relies on “circles”. Circles are managed by end-users but they offer to Google an unprecedented knowledge of social dynamics, whether global (tendencies, opinions, etc.) or personal (habits, hobbies, affinities). Graphs showing small group dynamics are rendered in real-time and automatically used for targeted advertising. They can also be aggregated to deep dive into tendencies, mass opinion and reveal new patterns in website attendances and commercial activities. Google not only has precise ideas about consuming habits, but also detains extremely relevant information about its business partners, a knowledge which proves a powerful asset when negotiating contracts.
Competitiveness at stake
The reasons of the giant firms’ interest for new technologies are obvious. However, the issue also concerns more modest firms and public players. Big Data still represents a huge untapped mine. The challenge consists in being able to analyze it. Part of solution is technical; the other part relies on the resources and capabilities to invent tools than sift through the data and distill relevant information.
A study by McKinsey tried to gauge the potential economic value of this new technological frontier. The results were very promising. According to a consultant such as McKinsey, all sectors of the economy and most public administrations should be able to take advantage this new discovery.
It seems rather obvious for sectors such as marketing and storage management, for giant retailers for example. Increased capabilities in these domains would have a direct and favorable impact on their net profits. Major administrations handle information on tens of thousands of citizens and could improve their management procedures by predicting tendencies, especially cost deviations, by targeting anomalies more effectively (and thus, preventing potential frauds). More generally, they would gain from a better understanding of customs and practices. McKinsey also evokes productivity gains in the industrial world.
All this assumes capabilities and therefore, a particular effort on internal formation, in the concerned firms, but also in the academic world. Building up this pool of capabilities is a long and difficult road, on which much of the future’s competitiveness is at stake.
Towards a new science?
Beyond the technological impact, Big Data is deeply changing the way scientists work. As explained by Jannis Kalinikos, professor of information systems at the London School of Economics, “the development of knowledge (and more generally, the making of sense) is driven by permutations and commutations performed on huge masses of data. This is a well-known tendency in the field of social sciences, but it extends now to all domains.
The conditions in which data is captured and aggregated exceed by far the best experts’ human physical possibilities in terms of memory and concentration. Commenting an article published by Wired magazine, Jannis Kallinikos takes for example a researcher at University of California, who tries to understand bone aging processes. To do so, the scientist scans super-high resolution x-ray pictures and combines those images into a three-dimensional structure. He then aggregates the results. As pointed out by Jannis Kallinikos, the researcher’s aim is not to provide evidence to the experts. The medical knowledge that will eventually emerge from this mass of information will be the result of correlations between terabytes of data obtained through aggregation of x-rays scans. We no longer witness the confrontation of theory and reality but an entirely new process: the pattern, if there is one, emerges from bottom-up processes of handling data through statistics.
The famous guru of the Internet era, Chris Anderson, has predicted, the end of theory and science in the sense we have known: a deductive conceptual construction based on empirical facts. Knowledge will eventually derive inductively from correlations performed on great datasets. That is an arguable position, for sure; and it certainly will give food for thought and fuel debates.
The Consequences of Information: Institutional Implications of Technological ChangeJannis Kallinikos
List Price: EUR 24,87
Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big DataBruce Ratner
List Price: EUR 62,30
- “How Much Information” (Peter Lyman, Hal R. Varian (2000, 2003))
- “Big Data: The Next Frontier for Innovation, Competition, and Productivity” (McKinsey Global Institute (2011))
- “The End of Theory” (Chris Anderson (2008))
More on paristech review
On the topic
- Turning the Data Deluge into Competitive AdvantageBy [email protected] on November 25th, 2011
- Challenges and Stakes of High Performance ComputingBy Philippe Ricoux on November 30th, 2011
- The Global GridBy McKinsey Quarterly on July 26th, 2011