The torrents of data flowing out of cancer research and treatment are yielding fresh insight into the disease.
In 2013, geneticist Stephen Elledge answered a question that had puzzled cancer researchers for nearly 100 years. In 1914, German biologist Theodor Boveri suggested that the abnormal number of chromosomes — called aneuploidy — seen in cancers might drive the growth of tumors. For most of the next century, researchers made little progress on the matter. They knew that cancers often have extra or missing chromosomes or pieces of chromosomes, but they did not know whether this was important or simply a by-product of tumor growth — and they had no way of finding out.
Elledge found that where aneuploidy had resulted in missing tumor-suppressor genes, or extra copies of the oncogenes that promote cancer, tumors grow more aggressively (T. Davoli et al.Cell 155, 948–962; 2013). His insight — that aneuploidy is not merely an odd feature of tumors, but an engine of their growth — came from mining voluminous amounts of cellular data. And, says Elledge, it shows how the ability of computers to sift through ever-growing troves of information can help us to deepen our understanding of cancer and open the door to discoveries.
Modern cancer care has the potential to generate huge amounts of data. When a patient is diagnosed, the tumor's genome might be sequenced to see if it is likely to respond to a particular drug. The sequencing might be repeated as treatment progresses to detect changes. The patient might have his or her normal tissue sequenced as well, a practice that is likely to grow as costs come down. The doctor will record the patient's test results and medical history, including dietary and smoking habits, in an electronic health record. The patient may also have computed tomography (CT) and magnetic resonance imaging (MRI) scans to determine the stage of the disease. Multiply all that by the nearly 1.7 million people diagnosed with cancer in 2013 in the United States alone and it becomes clear that oncology is going to generate even more data than it does now. Computers can mine the data for patterns that may advance the understanding of cancer biology and suggest targets for therapy.
Elledge's discovery was the result of a computational method that he and his colleagues developed, called the Tumor Suppressor and Oncogene Explorer. They used it to mine large data sets, including the Cancer Genome Atlas, maintained by the US National Cancer Institute, based in Bethesda, Maryland, and the Catalogue of Somatic Mutations in Cancer, run by the Wellcome Trust Sanger Institute in Hinxton, UK. The databases contained roughly 1.2 million mutations from 8,207 tissue samples of more than 20 types of tumor.
Analyzing the genomes of 8,200 tumors is just a start. Researchers are “trying to figure out how we can bring together and analyze, over the next few years, a million genomes”, says Robert Grossman, who directs the Initiative in Data Intensive Science at the University of Chicago in Illinois. This is an immense undertaking; the combined cancer genome and normal genome from a single patient constitutes about 1 terabyte (1012 bytes) of data, so a million genomes would generate an exabyte (1018 bytes). Storing and analysing this much data could cost US$100 million a year, Grossman says.
But it is the new technologies that are creating an information boom. “We can collect data faster than we can physically do anything with them,” says Manish Parashar, a computer scientist and head of the Rutgers Discovery Informatics Institute in Piscataway, New Jersey, who collaborates with Foran to find ways of handling the information. “There are some fundamental challenges being caused by our ability to capture so much data,” he says.
A major problem with data sets at the terabyte-and-beyond level is figuring out how to manipulate all the data. A single high-resolution medical image can take up tens of gigabytes, and a researcher might want the computer to compare tens of thousands of such images. Breaking down just one image in the Rutgers project into sets of pixels that the computer can identify takes about 15 minutes, and moving that much information from where it is stored to where it can be processed is difficult. “Already we have people walking around with disk drives because you can't effectively use the network,” Parashar says.
Informatics researchers are developing algorithms to split data into smaller packets for parallel processing on separate processors, and to compress files without omitting any relevant information. And they are relying on advances in computer science to speed up processing and communications in general.
Foran emphasizes that the understanding and treatment of cancer has undergone a dramatic shift as oncology has moved from one-size-fits-all attacks on tumours towards personalized medicine. But cancers are complex diseases controlled by many genes and other factors. “It's not as if you're going to solve cancer,” he says. But big data can provide new, better-targeted ways of grappling with the disease. “You're going to come up with probably a whole new set of blueprints for how to treat patients.”