During the recent Kalido webinar on data science, I was asked a number of questions about data science, which have since been published as a Kalido Expert View. Here's my take on the first question: Q: In your opinion, what is a data scientist?
One year and seven months after beginning construction, Facebook has brought its first datacenter on foreign soil online. That soil is in Lulea, town of 75,000 people on northern Sweden’s east coast, just miles south of the boundary separating the Arctic Circle from the somewhat-less-frigid land below it.
A single Narus ITA is capable of processing the full contents of 1.5 gigabytes worth of packet data per second. That's 5400 gigabytes per hour, or 129.6 terabytes per day, for each 10-gigabit network tap. All that data gets shoveled off to a set of logic servers using a proprietary messaging protocol, which process and reassemble the contents of the packets, turning petabytes per day into gigabytes of tabular data about traffic—the metadata of the packets passing through the box— and captured application data.
NSA operates many of these network tap operations both in the US and around the world. But that's a massive fire-hose of data to try to digest in any meaningful way and in the early days of packet capture, NSA faced a few major problems with that vast stream of data. Storing it, indexing it, and analyzing it in volume required technology beyond what was generally available commercially. Considering that, according to Cisco, the total world Internet traffic for 2012 was 1.1 exabytes per day is physically impossible, let alone practical, for the NSA to capture and retain even a fraction of the world's Internet traffic on a daily basis.