From recognizing speech to identifying unusual stars, new discoveries often begin with comparison of data streams to find connections and spot outliers. But simply feeding raw data into a data-analysis algorithm is unlikely to produce meaningful results, say the authors of a new Cornell study. That’s because most data comparison algorithms today have one major weakness: somewhere, they rely on a human expert to specify what aspects of the data are relevant for comparison, and what aspects aren’t.
But these experts can’t keep up with the growing amounts and complexities of big data. So the Cornell computing researchers have come up with a new principle they call “data smashing” for estimating the similarities between streams of arbitrary data without human intervention, and even without access to the data sources.
Data smashing is based on a new way to compare data streams. The process involves two steps.
- The data streams are algorithmically “smashed” to “annihilate” the information in each other.
- The process measures what information remains after the collision. The more information remains, the less likely the streams originated in the same source.
Data-smashing principles could open the door to understanding increasingly complex observations, especially when experts don’t know what to look for, according to the researchers. The researchers— Hod Lipson, associate professor of mechanical engineering and of computing and information science, and Ishanu Chattopadhyay, a former postdoctoral associate with Lipson now at the University of Chicago — demonstrated this idea with data from real-world problems, including detection of anomalous cardiac activity from heart recordings and classification of astronomical objects from raw photometry.
In all cases and without access to original domain knowledge, the researchers demonstrated that the performance of these general algorithms was on par with the accuracy of specialized algorithms and heuristics tweaked by experts to work.