62 — Data-driven approach for creating synthetic electronic medical records

Buczak et al (10.1186/1472-6947-10-59)

Read on 22 October 2017
#PHI  #EHR  #medical-records  #statistics  #model 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2972239/pdf/1472-6947-10-59.pdf

This work aims to generate fully synthetic electronic patient records using data analyses that closely emulate real patient data without exposing (and therefore compromising) real patients.

This project, part of a larger project called EMERGE, is designed to simulate the spread of a tularemia outbreak in a clinical facility, using both fake patient records and fake outbreak characteristics. This patient-simulation is similar to prior works such as the Realistic But Not Real (RBNR) project or Archimedes which models a patient longitudinally over a disease progression timeline, and other projects have aimed to anonymize clinician notes instead of generating patients from scratch.

This paper explains the process of synthetic patient data generation as a multi-step process: First, generation of positive-diagnosis patients (“victims”), and seperate generation of a “background” control population; and second, validation these records.

The background population was generated by perturbing real patient records using RBNR; a 5-year-old male with an appointment on August 15 might be altered to have a new name and birthday, and instead be represented by a 7-year-old male with an appointment on September 1 (±2wk).

Patients with similar care and clinical experiences (for instance, strep instead of tularemia) were used as proxies for the victims of the study. This outbreak can then be modeled by watching the spread of the disease across the clinical population. This was done by defining a distance metric $dist(I,P)$ for a patient’s care $P$ and the synthetically generated “inject”, or the outbreak simulation for that patient. By minimizing this metric, it’s possible to determine which patients in the population are most likely to be victims of the outbreak.

This means that this synthetic-patient-generation model, and the corresponding RBNR/Archimedes projects are, in some sense, a simple diagnostic model (where the diagnosis is already known and it’s a matter of trying to pick the right victim). This is an interesting inversion of the usual diagnostic process, which, though I don’t think it has clinical relevance, makes for an interesting protocol for generating a statistically feasible outbreak/diagnosis scenario.