Daniel Paul Professor of the Practice of Government and Technology, HKS and FAS


Researchers are increasingly asked to share research data as part of publication and funding processes and to maximize the benefits of publicly funded research. The Safe Harbor provision of the U.S. Health Information Portability and Accountability Act (HIPAA) offers guidance to researchers by prescribing how to redact data for public sharing. For example, the provision requires removing explicit identifiers (such as name, address and other personally identifiable information), reporting dates in years, and reducing some or all digits of a postal (or ZIP) code. Is this sufficient? Can research participants still be re-identified in research data that adhere to the HIPAA Safe Harbor standard? In 2006, researchers collected air and dust samples and interviewed residents of 50 homes from Bolinas and Richmond (Atchison Village and Liberty Village), California, to analyze the residents’ exposure to pollutants. The study, known as the Northern California Household Exposure Study [1], led to publications that have been cited hundreds of times. We conducted experiments with separate “attacker” and “scorer” teams to see whether we could identify study participants from two versions of the data redacted beyond the HIPAA standard, one in which all dates were reported in ranges of 10 or 20 years and another in which a study participant’s birth year was reported exactly. The attackers were blinded to the names and addresses of the participants, and the scorers were blinded to the strategy.


Sweeney, Latanya, Ji Su Yoo, Laura Perovich, Katherine E. Boronow, Phil Brown, and Julia Green Brody. "Re-identification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study." Technology Science (August 2017).