246 — Fast and simple comparison of semi-structured data, with emphasis on electronic health records

Robinson et al (10.1101/293183)

Read on 23 April 2018
#electronic-health-records  #EHR  #fingerprinting  #anonymization  #data-processing  #medical-data  #FHIR  #HL7  #SyntheticMass  #ePHI 

There are a variety of reasons why a researcher might want to compare datasets or individual rows between a dataset — and there are a variety of reasons why a researcher might be prohibited from doing so. One prevalant reason in the medical domain is HIPAA compliance, which prohibits the sharing of healthcare data (even sharing with algorithms or non-human entities) without first anonymizing it.

HIPAA compliance is a lot of fun.

If you can’t compare the dataset or row itself, what is the next best thing?

Robinson et al have developed a method to fingerprint the data, providing a simple, irreversible (and FHIR-compliant) way of determining if an object is the same as another.

For example, using the FHIR-based SyntheticMass synthetic patient database, the same patient, compared to itself, results in a value of 1.00. The same patient, compared to a very similar patient, only outputs a similarity of 0.997.

The fingerprints are n-length vector arrays, and by using the fingerprints as a vector embedding in nD space, it is possible to segment populations in interesting ways: For example, patients undergoing chemotherapy have a very different-looking fingerprint from those who are not.

This informs synthetic patient generation by improving the metrics for similarity-measurement; synthetic patients must have very similar or identical fingerprint distributions across a population in order to closely match the original population.