“[I]t seems that for many people the tech field, ‘data’ has become nearly synonymous with ‘Big Data.’ That kind of development usually indicates a fad. The reality is that, in practice, many data sets are ‘small,’ and in particular many relevant data sets are small.He was writing in 2010 — and, oh, what a fast five years it’s been. Janert goes on to say that classical statistics was built to perform inductive operations – start with a subset of a mess of information and draw conclusions about the mess. Big Data puts the mess in our midst, which is a mixed blessing.As Janert says: “Big Data makes it easy to forget the basics.”But there’s no avoiding it now: the fad is not fading. The rush of outrage greeting my recent summary post about AdWeek in NYC titled “’Big Data Is a Big Distraction’: Notes from #AdWeekXII” put me on notice. Never mind that I was quoting someone else (i.e., not myself) and was simply reporting the ad industry’s reaction against last year’s Big Data hysteria – a reaction against hype and not substance.Let us admit to ourselves the obvious: We need to walk into the light, amigos. Big Data is a big reality.So what do marketers need to know about it? What follows is a primer on the topic for the interested beginner. It’s based on a recent research report I published called “Understand Big Data Basics for Marketing” (Gartner subscribers enjoy here). It’s not – I mean not – for the white-coated, square-eyed crowd down there in the clean room.So: Big Data Basics for Marketers.What is Big Data? Let’s keep it simple. Big Data is data that is so big it won’t fit on a single machine. It has to be spread over many machines. And it can come from anywhere, so it might be in strange and exotic formats. And it’s coming fast. These ideas of size, road speed and formats are captured in the often-quoted concept of the “three V’s”: volume, velocity and variety.Big Data EcosystemBig Data is not a single technology or a short list of vendors. Rather, it is a loose collection of evolving tools, techniques and talent. These include three key categories of (1) storage, (2) processing and (3) analytics. Storage aligns with the volume component of Big Data’s “three v’s,” while processing aligns with velocity, and variety spans both. Analytics refers to the methods used to gain insights from all this stored and/or processed information.StorageEnterprise data is traditionally stored in relational databases and managed by a database management system. Relational means that the database is structured in tables that can call other tables in a carefully organized way. A fancy term for these structures is schema. Big Data storage differs from relational databases in that it often stores data that has not been mapped to a particular schema – rather, a schema can be imposed later (this is called schema-on-read). All this looseness means data is available more rapidly for use.So what is Hadoop? No doubt you have heard of this thing, and we’ve described it at some length in the past. In fact, like many Big Data terms, Hadoop is an umbrella: it is applied to different technologies that have three characteristics in common:Distributed data – data is spread over a number of different hardware locations, called nodes, increasing storage space and potentially controlling costCluster computing – processing is handled by clusters of computers whose nodes are linked together by software, so that they act like a single systemMassively parallel processing – data is processed simultaneously within the clusters, greatly increasing speedHadoop shares two other characteristics with many of the energetic Big Data technologies. First, it was developed by engineers at digital media companies: in this case, Yahoo and Google (which built a critical precursor called Map/Reduce). Why? Because things like search engines and social networks have to handle more data than any companies have ever had to handle before in order to operate; and so, just to stay in business, they have had to build things that didn’t exist before.Second, like most Big Data things, Hadoop was released into the open source world, curated by the Apache Software Foundation. Why? There are many reasons best explained by psychologists and sociologists (and lawyers), but ultimately open source technologies tend to get tested, improved, scrutinized and updated more rapidly, in more intensely practical ways, than many closed technologies. It’s hard for any but the largest companies to employ enough engineers and rigor to hone a closed piece of code; and meanwhile, there is plenty of money to be made selling products built on open source modules or wrapping the pieces together and making them look pretty.I am willing to guess you – yes, you – would be shocked if you really understood to what extent that whizzy piece of expensive cloud software you’re using actually (deep, deep in its soul) was running on absolutely free, not-developed-here, open source technology that you – yes, you – could probably bang into something almost as useful if you only knew how to do it.Hadoop is only part of the Big Data story. It usually exists within an ecosystem of other components that fill in its blanks and provide services we need: things like processing data on the move, speeding up the reading and writing of data, giving users ways to write queries to access the data, and so on.Just for fun, here’s a picture of a typical Big Data ecosystem:Processing”
Via
massimo facchinetti