Background: The HIV care continuum in the US is still short of achieving the UNAIDS target. Learning predictors for HIV outcomes may facilitate interventions that improve patient care. We aimed to learn predictors for viral suppression, retention, linkage to care and antiretroviral therapy (ART) adherence and discuss 1) the application of machine learning methods to predict HIV patient outcomes and 2) opportunities for HIV care improvement derived from HIV outcome predictors.
Methods: We selected HIV patient cohorts from a set of electronic health records (EHR) data and administrative claims data from Optum''s de-identified Integrated Claims-EHR dataset (2007-2016). We measured patient viral suppression, retention, linkage to care and adherence and constructed separate models for each. Cohorts for the four outcome measures ranged from 2,062 to 22,649 patients (TABLE-1). Outcome measures were determined from medication refills, diagnostic codes, lab results, and patient encounters. We extracted demographic, medical (comorbidity, medications, provider specialty) and encounter features from patient data. We used natural language processing methods to learn clinical text features, which include phrases recorded in provider written patient notes. To further explore data derived from clinical text, we built a word embedding that visualizes co-occurrence patterns of clinical text (FIGURE 1). We applied supervised machine learning methods via penalized logistic regression to predict outcome measures.
Results: Our best model predicts HIV suppression with a 0.84 AUC (TABLE-1). We compared model AUC values generated using bootstrapped samples from a held out test set. Models perform best with the complete ensemble of demographic, medical, and encounter features. The most significant features in predicting outcomes are the clinical text features derived with natural language processing of the EHR data. Examples of highly predictive clinical terms for HIV suppression include “migraine”, “n/v”, “negative anxiety”, and “verruca”.
Conclusions: This study achieved moderately accurate prediction for HIV outcomes by applying natural language processing and machine learning to EHR and claims data. We suggest that reliable prediction for HIV outcomes may be held in the unstructured patient notes, and can be derived from natural language processing and machine learning. We also demonstrate the validity of using these real world data for secondary use.

ModelsFeaturesAUCF1 ScoreFeature CountPatient Count
Retained in CareDemographic-only0.650.66362,062
 Clinical text-only0.730.691,2302,062
 All features0.750.701,5722,062
Engaged in CareDemographic-only0.630.633622,649
 Clinical text-only0.730.681,24222,649
 All features0.750.691,57222,649
Viral Load SuppressionDemographic-only0.750.783615,552
 Clinical text-only0.830.8381415,552
 All features0.830.831,12615,552
[TABLE-1 Model Results]

FIGURE-1 Viral Load Suppression Clinical Text Embedding
[FIGURE-1 Viral Load Suppression Clinical Text Embedding]

Download the e-Poster