6.S897 | Spring 2019 | Graduate

Machine Learning for Healthcare


Projects will include a proposal, poster presentation, and final report. See the list of final project suggestions.

Collaboration: Students should be in groups of three registered students. Doing something related to your research is fine, but your class project should be distinct and you should be able to isolate your contributions to the project from those of any collaborators outside of the class.

Relationship to other classes: You must ask instructors for permission before submitting a project proposal if you wish to use the same project for our class and another class (it should also be stated clearly in the proposal itself). If it is one project for two classes you must:

  • Produce a project that is twice as large in depth and content as would have been required for either class individually
  • Obtain permission from the instructor of the other class
  • Moreover, all students in the project should be enrolled in both classes

Project Components

  • Project proposals (one per group)
  • Project poster presentations
  • Project report (one per group)


At most three pages, one per group. Clearly state the following:

  • Problem you wish to tackle
  • Description of data you plan to use
  • Proposed approach and methods
  • Evaluation plan
  • Timeline
  • What each student in the group will do

We understand that much of this would be preliminary at this stage, but these details are important for us to ensure that you are on the right track.

Poster Guidelines

  • Posters should be 36” x 48”. 
  • Final project groups will be assigned in one of the two hour-long shifts. When you are not presenting, you should be learning about the final projects of your classmates.

Write-Up Guidelines

  • You are expected to turn in a PDF of your write-up. We strongly encourage you to open source your code and submit a link to it as part of your submission (e.g. a Github repository). You should include a readme file with instruction on how to reproduce your results as well as all the data pre-processing and analysis code. Please do not include any proprietary data in your submission.
  • Each team is expected to turn in a single project report of length at most 2n+2 pages where n is the number of students in the team. References are not counted toward page limit.
  • You are required to include a section that clearly outlines the contributions of each team member.
  • You may use any template you want. If you are looking for a clear template, we recommend the MLHC template (ZIP) This file contains 1 .jpeg, 1 .tex, 1 .pdf, 1 .sty, and 1 .cls file.
  • We encourage you to include the following sections in your writeup:
    1. Introduction: This section should include a brief explanation of your problem and its clinical importance. You should briefly explain your basic approach and your main conclusions. A figure is often helpful to motivate the work.
    2. Related work: This section should highlight previous work related to your problem and should put your work in a broader context. It may also include a comparison of why previous approaches could not be used to solve your particular problem.
    3. Methods: Here you should formally define your problem, and describe the method you implemented in detail. Include any simplifying assumptions that you make about your data or the general problem. You should enumerate any modelling choices that you had to make and justify your choices. A main figure illustrating the overall methodology often adds a lot.
    4. Data and experiment setup: Include details about your data, what variables you have access to, your cohort selection criteria, and your preprocessing choices. You might find it useful to include a table with population characteristics, or an example of the data available for a specific individual, both before (i.e. the original data) and after any pre-processing (i.e. feature construction), to make the discussion concrete. Describe your benchmarks.
    5. Results: Report the quantitative results of your analyses. You may choose to present graphs or tables, the important thing is that your tables and plots should summarize the relevant results that you got out of the analysis. Comment on these results: are they statistically significant? Are there interesting trends? Do you do significantly better than your benchmarks? Is there a significant treatment effect? You may also present qualitative results such as an in depth analysis of what the approach would do for a few randomly chosen patients.
    6. Discussion: Highlight how your results relate to your original question formulation. Do they support your hypothesis? Do they reveal interesting insights about existing medical practices, global health outcomes, the nature of diseases, etc? Discuss limitations with your analyses and how they might motivate future research directions.

Students will produce a final project using a health insurance claims database such as IBM MarketScan, Optum, Center for Medicare & Medicaid Services claims data, or state All-Payer Claims Databases.

Below we describe two types of projects, those focused more on answering a specific clinical question and those focused more on developing novel ML methods. That said, we imagine that many projects may involve a bit of each, and that’s OK!

MIMIC Projects

Clinical Projects

MIMIC-III provides a wealth of data to tackle a variety of clinical tasks in the ICU. Here are a few examples of potential clinical questions:

  • Identify organ failure
    • In the ICU, we are concerned about organ failure in the heart, kidney, liver, and lung. What predictors could be useful for organ failure?
    • We would want an early enough prediction to be productive and sufficient data in the lab tests (e.g. liver enzymes, creatinine and BUN, blood pressure) in earlier data.
  • Effects of interventions
    • Which interventions should we do and at what time?
    • Potential interventions include vasopressors, mechanical ventilation, dialysis, and cardiac assist devices.

Below are some examples of research papers that used the MIMIC-III dataset:

Machine Learning Methodology

Similar to above, you can also explore how to extend known machine learning methodology given the challenges of clinical data:

  • Clinical natural language processing (NLP)
    • How can we better extract entities from the clinical notes such as diseases, symptoms, and treatments? We saw in problem set 3 how existing methods can be flawed. Once we do have these entities, how can we identify relations between the extracted clinical concepts?
    • Can we construct a timeline from the clinical notes? How would these extracted entities relate with the coded events in the patient chart?
  • Time series
    • How can we better predict a patient’s progression? Challenges could include missing data, unknown alignment of patients, and heterogeneity of conditions.
    • How can we interpret a patient’s progression? Clinicians may be interested in how a patient progresses through known concepts (e.g. ICD-9 codes) and also what the specific stages of progression might be.

Projects Using IBM MarketScan Data

Although not publicly available, students were provided access to the IBM MarketScan research database in the 2019 version of the course. These ideas can easily be extended to other research databases such as Optum, Center for Medicare & Medicaid Services claims data, or state All-Payer Claims Databases. Since the MarketScan research database has coverage of a patient’s full longitudinal clinical trajectory, it is a gold mine for research and already thousands of papers have been published using it. Students should expect to access the data including basic demographics, diagnoses, procedures, outpatient prescription orders, laboratory test orders, and enrollment information.

Clinical Projects

This type of MarketScan project will focus on studying a new clinical question using machine learning and causal inference methods studied in class.

  • Early detection of a medical condition, e.g. rheumatoid arthritis or Sjogren’s syndrome
  • Causal inference of the effect of an intervention
  • Subtyping or clustering to better understand a population, e.g. individuals rehospitalized within 30-days

Below are examples of papers that have explored some of the above questions. We encourage students to use these for inspiration, but to tackle new clinical questions that weren’t already answered in these (e.g., studying a different disease or a different outcome).

Machine Learning Methodology

The projects mentioned here focus on developing new machine learning methods:

  • Develop better deep learning algorithms for claims data.
  • Develop synthetic data generation methods that produce realistic data but have provable privacy guarantees (this is a good project for a group with a theoretical computer science bent).
    • Recent work has been interested in using deep learning for this, e.g. Choi et al., 2017 and Hyland et al., 2017. However, there is a natural trade-off between truly capturing the data density (not just being able to reproduce aggregate statistics from synthetic data) and not leaking private information from the training data. How does one formalize this mathematically?
  • Learn models of what a “normal” treatment policy is for a disease (note: it may be sequential).
    • How do these differ across populations, either geographic or based on patient characteristics? Can you automatically discover different treatment strategies for the same condition? Can you identify patients that are outliers or are receiving abnormal treatment (these may because of fraud, improper treatment, medical errors, etc.)?
  • Develop algorithms for inferring causality (e.g., one condition causes another, or a medication causes a side-effect) from longitudinal claims data, e.g. by using ideas from Granger causality, the recently developed entropic causality, causal Bayesian networks, or Hawkes processes (PDF).
  • Develop algorithms to identify and explain non-stationarity, e.g. through changepoint detection and interpretable ML algorithms.

Other Datasets

Learning Resource Types
Lecture Videos
Lecture Notes