Simon DeDeo, a research fellow in applied mathematics and complex systems at the Santa Fe Institute, had a problem. He was collaborating on a new project analyzing 300 years’ worth of data from the archives of London’s Old Bailey, the central criminal court of England and Wales. Granted, there was clean data in the usual straightforward Excel spreadsheet format, including such variables as indictment, verdict, and sentence for each case. But there were also full court transcripts, containing some 10 million words recorded during just under 200,000 trials.
“How the hell do you analyze that data?” DeDeo wondered. It wasn’t the size of the data set that was daunting; by big data standards, the size was quite manageable. It was the sheer complexity and lack of formal structure that posed a problem. This “big data” looked nothing like the kinds of traditional data sets the former physicist would have encountered earlier in his career, when the research paradigm involved forming a hypothesis, deciding precisely what one wished to measure, then building an apparatus to make that measurement as accurately as possible.
“In physics, you typically have one kind of data and you know the system really well,” said DeDeo. “Now we have this new multimodal data [gleaned] from biological systems and human social systems, and the data is gathered before we even have a hypothesis.” The data is there in all its messy, multi-dimensional glory, waiting to be queried, but how does one know which questions to ask when the scientific method has been turned on its head?
Via Ashish Umre