The nature of computing has changed dramatically over the last decade, and more innovation is needed to weather the gathering data storm.
When subatomic particles smash together at the Large Hadron Collider in Switzerland, they create showers of new particles whose signatures are recorded by four detectors. The LHC captures 5 trillion bits of data — more information than all of the world’s libraries combined — every second. After the judicious application of filtering algorithms, more than 99 percent of those data are discarded, but the four experiments still produce a whopping 25 petabytes (25×10E15 bytes) of data per year that must be stored and analyzed. That is a scale far beyond the computing resources of any single facility, so the LHC scientists rely on a vast computing grid of 160 data centers around the world, a distributed network that is capable of transferring as much as 10 gigabytes per second at peak performance.
Google’s Alon Halevy believes that the real breakthroughs in big data analysis are likely to come from integration — specifically, integrating across very different data sets. “No matter how much you speed up the computers or the way you put computers together, the real issues are at the data level,” he said. For example, a raw data set could include thousands of different tables scattered around the Web, each one listing crime rates in New York, but each may use different terminology and column headers, known as “schema.” A header of “New York” can describe the state, the five boroughs of New York City, or just Manhattan. You must understand the relationship between the schemas before the data in all those tables can be integrated.
That, in turn, requires breakthroughs in techniques to analyze the semantics of natural language. It is one of the toughest problems in artificial intelligence — if your machine-learning algorithm aspires to perfect understanding of nearly every word. But what if your algorithm needs to understand only enough of the surrounding text to determine whether, for example, a table includes data on coffee production in various countries so that it can then integrate the table with other, similar tables into one common data set? According to Halevy, a researcher could first use a coarse-grained algorithm to parse the underlying semantics of the data as best it could and then adopt a crowd-sourcing approach like a Mechanical Turk to refine the model further through human input. “The humans are training the system without realizing it, and then the system can answer many more questions based on what it has learned,” he said.
Chris Mattmann, a senior computer scientist at NASA’s Jet Propulsion Laboratory and director at the Apache Software Foundation, faces just such a complicated scenario with a research project that seeks to integrate two different sources of climate information: remote-sensing observations of the Earth made by satellite instrumentation and computer-simulated climate model outputs. The Intergovernmental Panel on Climate Change would like to be able to compare the various climate models against the hard remote-sensing data to determine which models provide the best fit. But each of those sources stores data in different formats, and there are many different versions of those formats.
Many researchers emphasize the need to develop a broad spectrum of flexible tools that can deal with many different kinds of data. For example, many users are shifting from traditional highly structured relational databases, broadly known as SQL, which represent data in a conventional tabular format, to a more flexible format dubbed NoSQL. “It can be as structured or unstructured as you need it to be,” said Matt LeMay, a product and communications consultant and the former head of consumer products at URL shortening and bookmarking service Bitly, which uses both SQL and NoSQL formats for data storage, depending on the application.