365Papers

246 — Fast and simple comparison of semi-structured data, with emphasis on electronic health records

Read on 23 April 2018
#electronic-health-records #EHR #fingerprinting #anonymization #data-processing #medical-data #FHIR #HL7 #SyntheticMass #ePHI

There are a variety of reasons why a researcher might want to compare datasets or individual rows between a dataset — and there are a variety of reasons why a researcher might be prohibited from doing so. One prevalant reason in the medical domain is HIPAA compliance, which prohibits the sharing of healthcare data (even sharing with algorithms or non-human entities) without first anonymizing it.

HIPAA compliance is a lot of fun.

If you can’t compare the dataset or row itself, what is the next best thing?

Robinson et al have developed a method to fingerprint the data, providing a simple, irreversible (and FHIR-compliant) way of determining if an object is the same as another.

For example, using the FHIR-based SyntheticMass synthetic patient database, the same patient, compared to itself, results in a value of 1.00. The same patient, compared to a very similar patient, only outputs a similarity of 0.997.

The fingerprints are n-length vector arrays, and by using the fingerprints as a vector embedding in nD space, it is possible to segment populations in interesting ways: For example, patients undergoing chemotherapy have a very different-looking fingerprint from those who are not.

This informs synthetic patient generation by improving the metrics for similarity-measurement; synthetic patients must have very similar or identical fingerprint distributions across a population in order to closely match the original population.

Comments?

@j6m8