TL - The Strategist
Follow
Find
732 views | +0 today
 
Rescooped by ThierryLabro from Analytics & Social media impact on Healthcare
onto TL - The Strategist
Scoop.it!

The value of any data is only as valuable as the information and insights we can extract from it.

The value of any data is only as valuable as the information and insights we can extract from it. | TL - The Strategist | Scoop.it

Editor’s note: Dr. Michael Wu is the Principal Scientist of Analytics at Lithium where he is currently applying data-driven methodologies to investigate and understand the complex dynamics of the social Web.

The value of any data is only as valuable as the information and insights we can extract from it. It is the information and insights that will help us make better decisions and give us a competitive edge. The promise of big data is that one could glean lots of information and gain many valuable insights. However, people often don’t realize that data and information are not the same. Even if you are able to extract information from your big data, not all of it will be insightful and valuable.

Data ≠ Information

Many people speak of data and information as if they are synonymous, but the difference between the two is quite subtle. Data is simply a record of events that took place. It is the raw data that describes what happened when, where, and how and who’s involved. Well, isn’t that informative? Yes, it is!

While data does give you information, the fallacy of big data is that more data doesn’t mean you will get “proportionately” more information. In fact, the more data you have, the less information you gain as a proportion of the data. That means the information you can extract from any big data asymptotically diminishes as your data volume increases. This does seem counterintuitive, but it is true. Let’s clarify this with a few examples.

Example 1: Data backups and copies. If you look inside your computer, you will find thousands of files you’ve created over the years. Whether they are pictures you took, emails you sent, or blogs you wrote, they contain a certain amount of information. These files are stored as data in your hard drive, which takes up a certain amount of space.

Now, if you are as paranoid as I am, you will probably back up of your hard drive regularly. Think about what happens when you back up your hard drive for the first time. In terms of data, you’ve just doubled the amount of data you have. If you had 50 GB of data in your hard drive, you would have 100 GB after the back up. But will you have twice the information after the back up? Certainly not! In fact, you gain no additional information from this operation, because the information in the backup is exactly the same as the information in the original drive.

Although our personal data is not big data by any means, this example illustrates the subtle difference between data and information, and they are definitely not the same animal. Now let’s look at another example involving bigger data.

Example 2: Airport surveillance video logs. First, video files are already pretty big. Second, closed-circuit monitoring systems (CCTV) in an airport are on 24/7, and high-definition (HD) devices will only increase the data volume further. Moreover, there are hundreds and probably thousands of security cameras all over the airport. So as you can see, the video logs created by all these surveillance cameras would probably qualify as big data.

Now, what happens when you double the number of camera installations? In terms of data volume, you will again get about 2x the data. But will you get 2x the information? Probably not. Many of the cameras are probably seeing the same thing, perhaps from a slightly different angle, sweeping different areas at slightly different times. In terms of information content, we almost never get 2x. Furthermore, as the number of cameras continues to increase, the chance of information overlap also increases. That is why as data volume increases, information will always have a diminishing return, because more and more of it will be redundant.

A simple inequality characterizes this property: information ≤ data. So information is not data, it’s only the non-redundant portions of the data. That is why when we copy data, we don’t gain any information, even when the data volume increases, because the copied data is redundant.

Example 3: Updates on multiple social channels. What about social big data, such as tweets, updates, and/or shares? If we tweet twice as often, Twitter is definitely getting 2x more data from us. But will Twitter get 2x the information? That depends on what we tweet. If there is absolutely zero redundancy among all our tweets, then Twitter will have 2x the information. But that typically never happens. Let’s think about why.

First of all, we retweet each other. Consequently, many tweets are redundant due to retweeting. Even if we exclude retweets, the chance that we are coincidentally tweeting about the same content is actually quite high, because there are so many tweeters out there. Although the precise wording of each tweet may not be exactly the same, the redundancies among all the tweets containing the same Web content (whether it’s a blog post, a cool video, or breaking news) is very high. Finally, our interest and taste in good content remains fairly consistent over time. Since our tweets tend to reflect our interests and tastes, even apparently unrelated tweets from the same user will have some redundancies, because the tweeter is tweeting similar content.

Clearly, even if we tweet twice as often, Twitter is not going to get 2x the information because there is so much redundancy among our tweets (likewise with updates and shares on other social channels). Furthermore, we often co-syndicate content across multiple social channels. Since this is merely duplicate content across multiple social channels, it doesn’t give us any extra information about the user.

Although data does give rise to information, data ≠ information. Information is only the non-redundant parts of the data. Since most data, regardless of how it is generated, has lots of built-in redundancy, the information we can extract from any data set is typically a tiny fraction of the data’s sheer volume.

I refer to this property as the data-information inequality: information ≤ data. And in nearly all realistic data sets (especially big data), the amount of information one can extract from the data is always much less than the data volume (see figure below): information << data. Since the naïve assumption that big data leads to a lot of information is not true, the value of big data is hugely exaggerated.

 

opinion

Comment15

inShare368

The Big Data Fallacy And Why We Need To Collect Even Bigger Data

Dr. Michael Wu

posted yesterday

15 Comments

Editor’s note: Dr. Michael Wu is the Principal Scientist of Analytics at Lithium where he is currently applying data-driven methodologies to investigate and understand the complex dynamics of the social Web.

The value of any data is only as valuable as the information and insights we can extract from it. It is the information and insights that will help us make better decisions and give us a competitive edge. The promise of big data is that one could glean lots of information and gain many valuable insights. However, people often don’t realize that data and information are not the same. Even if you are able to extract information from your big data, not all of it will be insightful and valuable.

Data ≠ Information

Many people speak of data and information as if they are synonymous, but the difference between the two is quite subtle. Data is simply a record of events that took place. It is the raw data that describes what happened when, where, and how and who’s involved. Well, isn’t that informative? Yes, it is!

While data does give you information, the fallacy of big data is that more data doesn’t mean you will get “proportionately” more information. In fact, the more data you have, the less information you gain as a proportion of the data. That means the information you can extract from any big data asymptotically diminishes as your data volume increases. This does seem counterintuitive, but it is true. Let’s clarify this with a few examples.

Example 1: Data backups and copies. If you look inside your computer, you will find thousands of files you’ve created over the years. Whether they are pictures you took, emails you sent, or blogs you wrote, they contain a certain amount of information. These files are stored as data in your hard drive, which takes up a certain amount of space.

Now, if you are as paranoid as I am, you will probably back up of your hard drive regularly. Think about what happens when you back up your hard drive for the first time. In terms of data, you’ve just doubled the amount of data you have. If you had 50 GB of data in your hard drive, you would have 100 GB after the back up. But will you have twice the information after the back up? Certainly not! In fact, you gain no additional information from this operation, because the information in the backup is exactly the same as the information in the original drive.

Although our personal data is not big data by any means, this example illustrates the subtle difference between data and information, and they are definitely not the same animal. Now let’s look at another example involving bigger data.

Example 2: Airport surveillance video logs. First, video files are already pretty big. Second, closed-circuit monitoring systems (CCTV) in an airport are on 24/7, and high-definition (HD) devices will only increase the data volume further. Moreover, there are hundreds and probably thousands of security cameras all over the airport. So as you can see, the video logs created by all these surveillance cameras would probably qualify as big data.

Now, what happens when you double the number of camera installations? In terms of data volume, you will again get about 2x the data. But will you get 2x the information? Probably not. Many of the cameras are probably seeing the same thing, perhaps from a slightly different angle, sweeping different areas at slightly different times. In terms of information content, we almost never get 2x. Furthermore, as the number of cameras continues to increase, the chance of information overlap also increases. That is why as data volume increases, information will always have a diminishing return, because more and more of it will be redundant.

A simple inequality characterizes this property: information ≤ data. So information is not data, it’s only the non-redundant portions of the data. That is why when we copy data, we don’t gain any information, even when the data volume increases, because the copied data is redundant.

Example 3: Updates on multiple social channels. What about social big data, such as tweets, updates, and/or shares? If we tweet twice as often, Twitter is definitely getting 2x more data from us. But will Twitter get 2x the information? That depends on what we tweet. If there is absolutely zero redundancy among all our tweets, then Twitter will have 2x the information. But that typically never happens. Let’s think about why.

First of all, we retweet each other. Consequently, many tweets are redundant due to retweeting. Even if we exclude retweets, the chance that we are coincidentally tweeting about the same content is actually quite high, because there are so many tweeters out there. Although the precise wording of each tweet may not be exactly the same, the redundancies among all the tweets containing the same Web content (whether it’s a blog post, a cool video, or breaking news) is very high. Finally, our interest and taste in good content remains fairly consistent over time. Since our tweets tend to reflect our interests and tastes, even apparently unrelated tweets from the same user will have some redundancies, because the tweeter is tweeting similar content.

Clearly, even if we tweet twice as often, Twitter is not going to get 2x the information because there is so much redundancy among our tweets (likewise with updates and shares on other social channels). Furthermore, we often co-syndicate content across multiple social channels. Since this is merely duplicate content across multiple social channels, it doesn’t give us any extra information about the user.

Although data does give rise to information, data ≠ information. Information is only the non-redundant parts of the data. Since most data, regardless of how it is generated, has lots of built-in redundancy, the information we can extract from any data set is typically a tiny fraction of the data’s sheer volume.

I refer to this property as the data-information inequality: information ≤ data. And in nearly all realistic data sets (especially big data), the amount of information one can extract from the data is always much less than the data volume (see figure below): information << data. Since the naïve assumption that big data leads to a lot of information is not true, the value of big data is hugely exaggerated.

Information ≠ Insights

Although the amount of information we can extract from big data may be overrated, the insights we can derive from big data may still be extremely valuable. So what is the relationship between information and insights? All insights are information, but not all information provides insights. There are three criteria for information to provide valuable insights:

1. Interpretability. Since big data contains so much unstructured data and different media as well as data types, there is actually a substantial amount of data and information that is not interpretable.

For example, consider this sequence of numbers: 123, 243, 187, 89, and 156. What do these numbers mean? It could be the number of likes on the past five articles you read on TechCrunch, or it could be the luminance level of five pixels in a black and white image. Without more information and meta-data, there is no way to interpret what these numbers mean. Since data and information that are not interpretable won’t offer you any insights, insights must lie within the interpretable parts of the extractable information.

2. Relevance. Information must be relevant to be useful and valuable. Relevant information is also known as the signal, so irrelevant information is often referred to as noise. But relevance is subjective. Information that is relevant to me may be completely irrelevant to you, and vice versa. This is what Edward Ng, a renowned mathematician, means when he says “One man’s signal is another man’s noise.”

Furthermore, relevance is not only subjective, it is also contextual. What is relevant to a person may change from one context to another. If I’m visiting NYC next week, then NYC traffic will suddenly become very relevant to me. But after I return to SF, the same information will become irrelevant again. Therefore, insights are an even smaller subset within the relevant information (i.e. signals), which is already a tiny subset of the interpretable information.

3. Novelty. Information must be novel to be insightful. That means it must provide some new knowledge that you don’t already have.

Clearly this criterion is also subjective. Because what I know is very different from what you know, what is insightful to me may be old information to you, and vice versa. Part of this subjectivity is inherited from the subjectivity of relevance. If some information is irrelevant to you, then most likely you won’t know about it, so when you learn it, it will be new. But you probably wouldn’t care because it’s irrelevant. Even if it is novel, it’s of no value to you.

However, once an insight is found, it’s no longer new and insightful the next time you have it. Therefore as we learn and accumulate knowledge from big data, insights become harder to discover. The valuable insight that everyone wants is a tiny and shrinking subset of the relevant information (i.e. the signal).

If the information fails any one of these criteria, then it wouldn’t be a valuable insight. So these three criteria will successively restrict insights to an even tinier subset of the extractable information from big data (see figure). So the big data fallacy can be summarized by a simple inequality: insight << information << data.

The value of big data is hugely exaggerated, because insight (the most valuable aspect of big data) is typically a few orders of magnitude less than the extractable information, which is again several orders of magnitude smaller than the sheer volume of your big data. I’m not saying big data is not valuable, it’s just overrated, because even with big data, the probability for finding valuable insights from it will still be abysmally tiny.

The big data fallacy may sound disappointing, but it is actually a strong argument for why we need even bigger data. Because the amount of valuable insights we can derive from big data is so very tiny, we need to collect even more data and use more powerful analytics to increase our chance of finding them. Although big data cannot guarantee the revelation of many valuable insights, increasing the data volume does increase the odds of finding them.


Via Chaturika Jayadewa
more...
No comment yet.
TL - The Strategist
Elaborate a social media strategy
Curated by ThierryLabro
Your new post is loading...
Your new post is loading...
Scooped by ThierryLabro
Scoop.it!

7 Hot Trends Defining Your Content Strategy

7 Hot Trends Defining Your Content Strategy | TL - The Strategist | Scoop.it
Content marketing is an essential component in buyer engagement, social media, lead generation and...
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Les mots les plus "viraux" à utiliser dans les headlines

ThierryLabro's insight:

Retrouvez les mots qui sont les plus souvent dans les headlines qui deviennent virales

more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

News biz revenue has shrunk by a third since 2006

News biz revenue has shrunk by a third since 2006 | TL - The Strategist | Scoop.it
Pew The news business had about $30 billion more revenue in 2006 than it does today, Pew finds.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Charting the years-long decline of local news reporting

Charting the years-long decline of local news reporting | TL - The Strategist | Scoop.it
Charting the years-long decline of local news reporting Washington Post The report found that just 30 national and international news sites — Vice, Huffington Post, Politico, Buzzfeed, Mashable, among others — accounted for about 60 percent of all...
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Data journalism: Making it real

Data journalism: Making it real | TL - The Strategist | Scoop.it
The following is an edited version of a chapter I contributed to  a new book Data Journalism: Mapping the Future, published  by Abramis academic publishing. The fact that I'm in it aside, I can heartily recommend it as a great mix of practical and contextual information. Go and buy one. Go on! During the 2008 summer…
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Why Patch Dried Up: Is Hyperlocal Over, Or Was It AOL?

Why Patch Dried Up: Is Hyperlocal Over, Or Was It AOL? | TL - The Strategist | Scoop.it
Patch is dying, but can hyperlocal journalism survive?
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Nouveau plaidoyer pour un journalisme de qualité [en allemand]

Rede von Veit Dengler an der Dreikönigstagung 2014
Twitter, Facebook und Co.: Ende der journalistischen Qualität oder Beginn einer pluralistischen Medienwelt?
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Google Webmaster Tools indique désormais précisément les taux de clics

Google Webmaster Tools indique désormais précisément les taux de clics | TL - The Strategist | Scoop.it
L'outil de Google permet désormais d'analyser plus précisément les requêtes, le trafic qu'elles représentent et leur potentiel.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Colorado paper finds ways to stay relevant with local readers

Colorado paper finds ways to stay relevant with local readers | TL - The Strategist | Scoop.it
Colorado paper finds ways to stay relevant with local readers Inland Press “When he came here, we had a different strategy for citizen journalism and neighborhood news that wasn't effective,” she said.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

On the restless currents of hyperlocal journalism, EveryBlock returns as Patch ... - Chicago Reader

On the restless currents of hyperlocal journalism, EveryBlock returns as Patch ... - Chicago Reader | TL - The Strategist | Scoop.it
Chicago Reader On the restless currents of hyperlocal journalism, EveryBlock returns as Patch ...
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Understanding ‘how information is shared’ can be as valuable as ‘traditional reporting talent’

Adweek | Bloomberg News | Dave Weigel BuzzFeed CEO Jonah Peretti explains to Adweek what his publication looks for in employees: “People who really understand how information is shared on Twitter and Facebook and Instagram and other emerging platforms,...
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Is viral content the next bubble?

The Wire | PandoDaily | The Wall Street Journal
The website Viral Nova emulates sites like BuzzFeed and Upworthy, and was in October “already nearly half the size of the sites that inspired it,” Alex Litel writes.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

3 Reasons Why Relying on Social-Media Marketing Is a Losing Strategy

3 Reasons Why Relying on Social-Media Marketing Is a Losing Strategy | TL - The Strategist | Scoop.it
Organic reach has plummeted. You need to think of social media as just one facet of a diverse marketing plan.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

3 takeaways from the ‘death of the homepage’ and the New York Times innovation report

3 takeaways from the ‘death of the homepage’ and the New York Times innovation report | TL - The Strategist | Scoop.it
Last week, Quartz and The Atlantic declared the homepage dead — killed by social media platforms, the “side doors” that deliver readers directly to stories.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

51 Startup Failure Post-Mortems

51 Startup Failure Post-Mortems | TL - The Strategist | Scoop.it
A compilation of startup failure post-mortems by founders and investors. No survivorship bias here.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Should social media be regulated during election campaigns?

“It would be very easy to launch a smear campaign over social media without any accountability at all,” he said.
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

The Times 'moving towards profit' since paywall launch

The Times 'moving towards profit' since paywall launch | TL - The Strategist | Scoop.it
Journalism.co.uk (blog) The Times 'moving towards profit' since paywall launch Journalism.co.uk (blog) Since introducing a digital subscription model in 2010, The Times and Sunday Times have amassed around 375,000 monthly subscribers and...
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Curiosity Journalism: The Newest Trend In Content Marketing

Here's a look at how you chan employ them in your marketing strategy. Creating a Curiosity.
more...
No comment yet.
Rescooped by ThierryLabro from SciSocMed
Scoop.it!

How newsrooms can use social analytics to guide editorial strategy

How newsrooms can use social analytics to guide editorial strategy | TL - The Strategist | Scoop.it
Social media experts share pointers on how to find relevant data on how stories are consumed, shared and discovered, and what to do with the results

Via KateAG
more...
No comment yet.
Rescooped by ThierryLabro from Innovative Marketing and Crowdfunding
Scoop.it!

The No-Nonsense Guide To Make Money From Content Curation

The No-Nonsense Guide To Make Money From Content Curation | TL - The Strategist | Scoop.it
Content curation is all about collecting best of the stuff from the web. A content curation blog is one easy way to make money and add passive income.

Via Marty Koenig
more...
Lori Wilk's curator insight, December 26, 2013 10:37 PM

There's no easy paycheck in content curating or blogging for that matter. It takes time even if you have talent.

Marty Koenig's comment, December 26, 2013 11:53 PM
Thanks everyone for your comments, rescoops, tweets, and FB posts!!
Dave Phillipson's curator insight, December 29, 2013 4:50 PM

I typically don't recommend such information because most "info sales" people have nothing to offer and I absolutely HATE furthering the process.    There is someone that I would like to see this, however and I'm marking this for his attention.

Scooped by ThierryLabro
Scoop.it!

Your E-Reader Is Tracking You - Slate Magazine (blog)

Your E-Reader Is Tracking You - Slate Magazine (blog) | TL - The Strategist | Scoop.it
Slate Magazine (blog) Your E-Reader Is Tracking You Slate Magazine (blog) The biggest difference between Web journalism and print journalism is that on the Web both publishers and advertisers have some idea about what readers are actually doing,...
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Why you shouldn’t post when your fans are online

Why you shouldn’t post when your fans are online | TL - The Strategist | Scoop.it
We looked at the new Facebook Insights metric
more...
No comment yet.
Scooped by ThierryLabro
Scoop.it!

Colorado paper finds ways to stay relevant with local readers - Inland Press

Colorado paper finds ways to stay relevant with local readers - Inland Press | TL - The Strategist | Scoop.it
Colorado paper finds ways to stay relevant with local readers
Inland Press
“When he came here, we had a different strategy for citizen journalism and neighborhood news that wasn't effective,” she said.
more...
No comment yet.