Image: ifindkarma/Flickr The era of Big Data is not ?coming soon.? It?s here today and it has brought both painful changes and unprecedented opportunit
Data scientists are looking at the classic V’s:
• Volume – The costs of compute, storage, and connectivity resources are plunging, and new technologies like scanners, smartphones, ubiquitous video, and other data-collectors mean we are awash in volumes of data that dwarf what was available even five to 10 years ago. We capture every mouse click, phone call, text message, Web search, transaction, and more. As the volume of data grows, we can learn more – but only if we uncover the meaningful relationships and patterns.
• Variety – From the endless streams of text data in social networking and geolocation data, to structured wallet share and demographics, companies are capturing a more diverse set of data than ever. Bringing it together is no small task.
• Velocity – It’s a truism that the pace of business is inexorably accelerating. The volume and variety of Big Data alone would be daunting enough. But now, that data is coming faster than ever. For some applications, the data shelf life is short. Speed kills competitors if you tame these waves of data – or it can kill your organization if it overwhelms you.
IBM has coined a worthy V – “veracity” – that addresses the inherent trustworthiness of data. The uncertainty about the consistency or completeness of data and other ambiguities can become major obstacles. As a result, basic principles as data quality, data cleansing, master data management, and data governance remain critical disciplines when working with Big Data.
It wasn’t very long ago when a terabyte was considered large. But now, that seems like a rounding error. Today, we create 2.5 quintillion bytes of data every day. In fact, we’re creating so much data so quickly that 90 percent of the data in the world today has been created in the last two years alone. Clearly, traditional ways of managing data must change.
In response, IT organizations have rethought their infrastructures and made tremendous progress in designing sophisticated computing architectures to tackle these extraordinary computing challenges. Data scientists have harnessed such technologies as grid computing, cloud computing, and in-database processing to bring a level of pragmatic feasibility to what were inconceivable computing challenges.
The Fourth V: Viability
But we need more than shiny plumbing to analyze massive data sets in real time. That’s merely a great start. But what can we do with that infrastructure? Where do we start? The first place to look is in the metadata. We want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses. With Big Data, we’re not simply collecting a large number of records. We’re collecting multidimensional data that spans a broadening array of variables. The secret is uncovering the latent, hidden relationships among these variables.