Final Project Suggestions

Students will produce a final project using a health insurance claims database such as IBM MarketScan, Optum, Center for Medicare & Medicaid Services claims data, or state All-Payer Claims Databases.

Below we describe two types of projects, those focused more on answering a specific clinical question and those focused more on developing novel ML methods. That said, we imagine that many projects may involve a bit of each, and that’s OK!

MIMIC Projects

Clinical Projects

MIMIC-III provides a wealth of data to tackle a variety of clinical tasks in the ICU. Here are a few examples of potential clinical questions:

  • Identify organ failure
    • In the ICU, we are concerned about organ failure in the heart, kidney, liver, and lung. What predictors could be useful for organ failure?
    • We would want an early enough prediction to be productive and sufficient data in the lab tests (e.g. liver enzymes, creatinine and BUN, blood pressure) in earlier data.
  • Effects of interventions
    • Which interventions should we do and at what time?
    • Potential interventions include vasopressors, mechanical ventilation, dialysis, and cardiac assist devices.

Below are some examples of research papers that used the MIMIC-III dataset:

Machine Learning Methodology

Similar to above, you can also explore how to extend known machine learning methodology given the challenges of clinical data:

  • Clinical natural language processing (NLP)
    • How can we better extract entities from the clinical notes such as diseases, symptoms, and treatments? We saw in problem set 3 how existing methods can be flawed. Once we do have these entities, how can we identify relations between the extracted clinical concepts?
    • Can we construct a timeline from the clinical notes? How would these extracted entities relate with the coded events in the patient chart?
  • Time series
    • How can we better predict a patient’s progression? Challenges could include missing data, unknown alignment of patients, and heterogeneity of conditions.
    • How can we interpret a patient’s progression? Clinicians may be interested in how a patient progresses through known concepts (e.g. ICD-9 codes) and also what the specific stages of progression might be.

Projects Using IBM MarketScan Data

Although not publicly available, students were provided access to the IBM MarketScan research database in the 2019 version of the course. These ideas can easily be extended to other research databases such as Optum, Center for Medicare & Medicaid Services claims data, or state All-Payer Claims Databases. Since the MarketScan research database has coverage of a patient’s full longitudinal clinical trajectory, it is a gold mine for research and already thousands of papers have been published using it. Students should expect to access the data including basic demographics, diagnoses, procedures, outpatient prescription orders, laboratory test orders, and enrollment information.

Clinical Projects

This type of MarketScan project will focus on studying a new clinical question using machine learning and causal inference methods studied in class.

  • Early detection of a medical condition, e.g. rheumatoid arthritis or Sjogren’s syndrome
  • Causal inference of the effect of an intervention
  • Subtyping or clustering to better understand a population, e.g. individuals rehospitalized within 30-days

Below are examples of papers that have explored some of the above questions. We encourage students to use these for inspiration, but to tackle new clinical questions that weren’t already answered in these (e.g., studying a different disease or a different outcome).

Machine Learning Methodology

The projects mentioned here focus on developing new machine learning methods:

  • Develop better deep learning algorithms for claims data.
  • Develop synthetic data generation methods that produce realistic data but have provable privacy guarantees (this is a good project for a group with a theoretical computer science bent).
    • Recent work has been interested in using deep learning for this, e.g. Choi et al., 2017 and Hyland et al., 2017. However, there is a natural trade-off between truly capturing the data density (not just being able to reproduce aggregate statistics from synthetic data) and not leaking private information from the training data. How does one formalize this mathematically?
  • Learn models of what a "normal" treatment policy is for a disease (note: it may be sequential).
    • How do these differ across populations, either geographic or based on patient characteristics? Can you automatically discover different treatment strategies for the same condition? Can you identify patients that are outliers or are receiving abnormal treatment (these may because of fraud, improper treatment, medical errors, etc.)?
  • Develop algorithms for inferring causality (e.g., one condition causes another, or a medication causes a side-effect) from longitudinal claims data, e.g. by using ideas from Granger causality, the recently developed entropic causality, causal Bayesian networks, or This resource may not render correctly in a screen reader.Hawkes processes (PDF).
  • Develop algorithms to identify and explain non-stationarity, e.g. through changepoint detection and interpretable ML algorithms.

Other Datasets