246 — Fast and simple comparison of semi-structured data, with emphasis on electronic health records
Robinson et al (10.1101/293183)
Read on 23 April 2018There are a variety of reasons why a researcher might want to compare datasets or individual rows between a dataset — and there are a variety of reasons why a researcher might be prohibited from doing so. One prevalant reason in the medical domain is HIPAA compliance, which prohibits the sharing of healthcare data (even sharing with algorithms or non-human entities) without first anonymizing it.
HIPAA compliance is a lot of fun.
If you can’t compare the dataset or row itself, what is the next best thing?
Robinson et al have developed a method to fingerprint the data, providing a simple, irreversible (and FHIR-compliant) way of determining if an object is the same as another.
For example, using the FHIR-based SyntheticMass synthetic patient database, the same patient, compared to itself, results in a value of 1.00. The same patient, compared to a very similar patient, only outputs a similarity of 0.997.
The fingerprints are n-length vector arrays, and by using the fingerprints as a vector embedding in nD space, it is possible to segment populations in interesting ways: For example, patients undergoing chemotherapy have a very different-looking fingerprint from those who are not.
This informs synthetic patient generation by improving the metrics for similarity-measurement; synthetic patients must have very similar or identical fingerprint distributions across a population in order to closely match the original population.