137 — An Unsupervised Homogenization Pipeline for Clustering Similar Patients using Electronic Health Record Data

Ulloa et al (1801.00065)

Read on 04 January 2018
#EHR  #medicine  #clustering  #clinical 

The ability to diagnose a patient and track their health is often a function of how well a physician is able to pair existing textbook knowledge with past experiences in order to tailor a treatment plan to an individual. This paper explores the concept of patient “clustering” based on constituent data in an EHR or other medical record system, which ideally allows a machine learning platform to contribute to this tailoring process. In short, this is an embedding problem: a “patient2vec” approach for the clinic.

These researchers identify several of the many challenges associated with parsing an EHR: Namely, data are incomplete and inconsistent; data may be discrete or continuous; measurements may be redundant, or may even conflict with one another. These challenges all increase the difficulty of designing an automated clustering algorithm.

Because working with actual patient data is challenging for a variety of reasons, these authors chose to mimic EHR-like datasets by generating random data with similar distributions and noise as real patient records. They then tested three imputation approaches to resolve missing datapoints: Median Imputation, KNN, and Multiple Imputation by Chained Equations (MICE), which builds local regression models and predicts an element based on that function.

They found that changes to the feature-reduction algorithm (out of Deep Autoencoders, DAE, versus Denoising Autoencoders, DnAE) and normalization methods (Z-score versus MinMax, and Whitening) dramatically changed the clustering results of the simulated data.

This is important because it means that datasets and the algorithms used can affect diagnoses that are based upon the resultant clusters; it suggests that a human-selected method is the wrong approach for building a normalized, sanitized dataset, and instead, there must be a more robust, scientifically defensible way of normalizing medical data.