A recent article suggests that open science may be irreconcilable with anonymous data, requiring a reconsideration of how we protect privacy in educational data.
The short version: many people have called for making science more open and transparent by sharing data and posting data openly. This allows researchers to check each other's work and to aggregate smaller datasets into larger ones. One saying that I'm fond of is: "the best use of your dataset is something that someone else will come up with." The problem is that increasingly, all of this data is about us. In education, it's about our demographics, our learning behavior, and our performance. Across the social sciences, it's about our health, our beliefs, and our social connections. Sharing and merging data adds to the risk of disclosing those data.
The article shares a case study of our efforts to strike a balance between anonymity and open science by de-identifying a dataset of learner data from HarvardX and releasing it to the public. In order to de-identify the data to a standard that we thought was reasonably resistant to reidentification efforts, we had to delete some records and blur some variables. If a learner's combination of identifying variables was too unique, we either deleted the record or scrubbed the data to make it look less unique. The result was suitable for release (in our view), but as we looked more closely at the released dataset, it wasn't suitable for science. We scrubbed the data to the point where it was problematically dissimilar from the original dataset. If you do research using our data, you can't be sure if your findings are legitimate or an artifact of de-identification.