True believers may be guilty of hype, but there’s no denying that big data presents opportunities for businesses of every stripe. That potential is vulnerable to pollution from data bias, and so calls for preventative processes.
Data bias comes in many forms. It can come from poorly defined business domain objectives. Or, it can come from opting to gather data that are easy to collect rather than data that are most informative. Data scientists can also receive data that have been biased by incorrect assumptions by the domain experts. (And as a footnote, the recent example of the austerity economics Excel scandal shows how a minute data error can have cascading and devastating effects.)
Likewise, data scientists themselves are not immune to bias. Some can run afoul of their own preconceived notions about business domain – too much knowledge can cause one to filter out data that may actually be helpful. Scientists with deep experience in a particular data set may develop too much reliance on pre-existing algorithms without re-examining validity for a particular use case.
Finally, data quantity is a common problem. Intelligent learning requires abundant data, and often the data available are not sufficient to draw accurate conclusions – a problem known as data sparsity. This may sound unbelievable considering that data volume is doubling every two years according to an EMC study, but there’s a difference between a dense data set populated by similar data points, and the far more diverse sets of user data points we find in the real world. In these cases, the gaps in the data are filled by machine learning algorithms that may inherently be biased, based on assumptions made by the data scientist when designing the algorithm. The trick is to find the right balance between unbiased data exploration and data exploitation.