With public datasets becoming available, could the re-identification of individuals pose a real threat to the use of big data?
The process of re-identifying individuals refers to using anonymized data to find individuals in public datasets. In order to re-identify individuals in large datasets, all you need is a laptop, an internet connection and public datasets and you can start digging for personal identifiable information (PII) hidden in the dataset. It looks simple; it is difficult but not impossible as researches from the Whitehead Institute recently showed. They were able to re-identify 50 individuals who had submitted personal DNA information in genomic studies such as the 1000 Genomes Project.
As with surnames, the Y Chromosome is passed on from father to son and using this information they started analysing public database that housed Y-STR data and surnames. They linked public datasets to the dataset collected by the Center for the Study of Human Polymorphisms (CEPH) to identify 50 men and women out of data that was de-identified. With more and more public datasets becoming available could the re-identification of individuals pose a real threat to the use of big data and open datasets?