Module 4: Case Studies with Data

Case Study on Natural Language Processing: Identifying and Mitigating Unintended Demographic Bias in Machine Learning for NLP slides (PDF - 1.3MB)

Learning Objectives

  • Explore a case study on bias in NLP.
  • Demonstrate techniques to mitigate word embedding bias.


Natural language processing is used across multiple domains, including education, employment, social media, and marketing. There are many sources of unintended demographic bias in the NLP pipeline. The NLP pipeline is the collection of steps from collecting data to making decisions based on the model results.

Unintended demographic bias

Key definitions for this course:

  • Unintended: bias has an adverse side effect, but it is not deliberately learned.
  • Demographic: the bias is some form of inequality between demographic groups.
  • Bias: artifact of the NLP pipeline that causes unfairness.

There are two types of unintended demographic bias, sentiment bias and toxicity bias. Sentiment bias refers to an artifact of the ML pipeline that causes unfairness in sentiment analysis algorithms. Toxicity bias refers to an artifact of the ML pipeline that causes unfairness in toxicity prediction algorithms.

Whether a phrase is toxic or non-toxic can be determined by a single word, and specific nationalities or groups are often marginalized. For example, “I am American” may be classified as non-toxic where as “I am Mexican” may be classified as toxic.

Bias introduction

Bias introduction can occur at many phases in the NLP pipeline, including the word corpus, word embedding, dataset, ML algorithm, and decision steps. Possible unfairness can occur from the applying these results to society.

Measuring word embedding bias

Word embeddings encode text into vector spaces where the distance between words describes important semantic meaning. This allows for analogies such as man is to woman as to king is to queen. However, research shows that biases from word embeddings trained from Google News articles will complete the analogy man is to woman as to computer scientist is to homemaker.

A method to quantify word embedding bias is demonstrated in the slides. Biased word embeddings are used to initialize a set of unbiased labeled word sentiments. A logistic regression classifier is trained using this dataset and predicts negative sentiment for a set of identities, for example “American” or “Canadian.”

The probabilities for negative sentiment can be compared to a uniform distribution to generate a relative negative sentiment bias (RNSB) score.

Mitigating word embedding bias

Adversarial learning can be used to debias word embeddings. Different identity terms can be more or less correlated with negative or postitive sentiment. In datasets, “American,” “Mexican,” or “German” can have more correlations with negative sentiment subspaces, which can cause downstream unfairness. We would ideally have balanced correlations between positive and negative sentiment subspaces for each group to prevent any effects of bias. Adversarial learning algorithms can be used to mitigate these biases.

Key Takeaways

  • No silver bullet for NLP applications.
  • Bias can come from all stages of the ML pipeline and implementing mitigation strategies at each stage is essential to addressing bias.


Content presented by Audace Nakeshimana (MIT).

The research for this project was conducted in collaboration with Christopher Sweeney and Maryam Najafian (MIT). The content for this presentation was created by Audace Nakeshimana, Christopher Sweeney, and Maryam Najafian (MIT).

Case Study on Pulmonary Health slides (PDF - 1.2MB)

Learning Objectives

  • Present a case study on pulmonary health diagnostics.
  • Explore the influence of representative data on accuracy.



Pulmonary diseases, including asthma, COPD, allergic rhinitis, and others can have significant detrimental health impacts if undetected. In remote areas with limited access to healthcare, they can often go undiagnosed and untreated. The motivation for this work was to develop a screening tool for community healthcare workers to determine if patients who were presenting symptoms of pulmonary disease actually have pulmonary disease.

To develop the tool, data was collected from 303 patients who sought medical care at health clinics between 2015 and 2018 in Pune, India. Patient data was collected at health clinics from two exams, a mobile diagnostic kit developed by Dr. Fletcher’s group and a set of measurements from a Pulmonary Function Test lab by the research team, and health diagnoses were performed by medical staff with a focus on asthma, allergic rhinitis, and COPD.

Study design

This exploration of representative sampling on accuracy was conducted across two protected variables: gender and income. For income considerations, patients were categorized as either low-income or high-income. The overall approach to the bias study was to divide the dataset into a larger training data superset and a test dataset. A logistic regression model with L2 regularization was used to make predictions on disease.

To train the model, training data subsets were randomly sampled from the superset that intentionally introduced imbalances along protected variables. For example, with regards to income, training data subsets ranged from 50% high-income to 50% low-income and 87.5% high-income to 12.5% low-income. To account for stochastic error, this process was run 1000 times for each test. The area under the curve of the receiver operating characteristic curve was used as the metric for accuracy.


The results for predictive accuracy for AR, Asthma, and COPD are shown below. The data shows no significant decrease in algorithm accuracy as gender imbalances are introduced in the data. It is important to note that protected variables do not necessarily affect outcome variables and that lack of representativeness may not always introduce bias or unfairness into models.


In our dataset, we found that smoking heavily correlated with gender. 55% of men reported that they were non-smokers, whereas 100% of women reported that they were non-smokers. As a result, the population of women was more homogenous, allowing for higher predictive accuracy.

The results for predictive accuracy for AR, Asthma, and COPD are shown below. Again, we see little difference in accuracy as we change representativeness within the sample. COPD is the most sensitive to socio-economic status with a 4% difference in model accuracy for high-income and low-income populations. Asthma and allergic rhinitis show no difference in performance.



Content presented by Amit Gandhi (MIT).

The research and content for this module was conducted by Rich Fletcher and Olasubomi Olubeko (MIT) in collaboration with the Chest Research Foundation. Additional funding for the project was provided by NSF and the Vodafone Americas Foundation.

Mitigating Gender Bias slides (PDF - 1.6MB)

Learning Objectives

  • Explore steps and principles involved in building less-biased machine learning modules.
  • Explore two classes of technique, data-based and model-based techniques for mitigating bias in machine learning.
  • Apply these techniques to the UCI adult dataset.


The repository for this module can be found at Github - ML Bias Fairness.

Defining algorithmic/model bias

Bias or algorithmic bias will be defined as systematic errors in an algorithm/model that can lead to potentially unfair outcomes.

Bias can be quantified by looking at discrepancies in the model error rate for different populations.

UCI adult dataset

The UCI adult dataset is a widely cited dataset used for machine learning modeling. It includes over 48,000 data points extracted from the 1994 census data in the United States. Each data point has 15 features, including age, education, occupation, sex, race, and salary, among others.

The dataset has twice as many men as women. Additionally, the data shows income discrepancies across genders. Approximately 1 in 3 of men are reported to make over $50K, whereas only 1 in 5 women are reported to make the same amount. For high salaries, the number of data points in the male population is significantly higher than the number of data points in the female category.

Preparing data

In order to prepare the data for machine learning, we will explore different steps involved in transforming data from raw representation to appropriate numerical or categorical representation. One example is converting native country to binary, representing individuals whose native country was the US as 0 and individuals whose native country was not the US as 1. Similar representations need to be made for other attributes such as sex and salary. One-hot encoding can be used for attributes where more than two choices are possible.

Binary coding was chosen for simplicity, but this decision must be made on a case-by-case basis. Converting features like work class can be problematic if individuals from different categories have systematically different levels of income. However, not doing this can also be problematic if one category has a population that is too small to generalize from.

Illustrating gender bias

We apply the standard ML approach to the UCI adult dataset. The steps that are followed are 1) splitting the dataset into training and test data, 2) selecting model (MLPClassifier in this case), 3) fitting the model on training data, and 4) using the model to make predictions on test data. For this application, we will define the positive category to mean high income (>$50K/year) and the negative category to mean low income (<=$50K/year).

The model results show that the positive rates and true positive rates are higher for the male demographic. Additionally, the negative rate and true negative rates are higher for the female demographic. This shows consistent disparities in the error rates between the two demographics, which we will define as gender bias.

Exploring data-based debiasing techniques

We hypothesize that gender bias could come from unequal representation of male and female demographics. We attempt to re-calibrate and augment the dataset to equalize the gender representation in our training data. We will explore the following techniques and their outcomes. We will compare the results after describing each approach.

Debiasing by unawareness: we drop the gender attribute from the model so that the algorithm is unaware of an individual’s gender. Although the discrepancy in overall accuracy does not change, the positive, negative, true positive, and true negative rates are much closer for the male and female demographics.

Equalizing the number of data points: we attempt different approaches to equalizing the representation. The different equalization criteria are #male = #female, #high income male = #high income female, #high income male/#low income male = #high income female/#low income female. One of the disadvantages of equalizing the number of data points is that the dataset size is limited by the size of the smallest demographic. Equalizing the ratio can overcome this limitation.

Augment data with counterfactuals: for each data point Xi with a given gender, we generate a new data point Yi that only differs with Xi at the gender attribute and add it to our dataset. The gaps between male and female demographics are significantly reduced through this approach.

We see varying accuracy across different approaches on accuracy for the male and female demographics, as shown in the plot below. The counterfactual approach is shown to be the best at reducing gender bias. We see similar behavior for the positive rates and negative rates as well as the true positive and true negative rates.

Model-based debiasing techniques

Different ML models show different levels of bias. By changing the model type and architecture, we can observe which ones will be less biased for this application. We examine single and multi-model architectures. The models that will be considered are support vector, random forest, KNN, logistic regression, and MLP classifiers. Multi-model architectures involve training a group of different models that make a final prediction based on consensus. Two approaches can be used for consensus; hard voting, where the final prediction is the majority prediction among the models and soft voting, where the final prediction is the average prediction. The following plots show the differences in overall accuracy and the discrepancies between accuracy across gender.

It is also important to compare the results of the models across multiple training sessions. For each model type, five instances of the model were trained and compared. Results are shown the plot below. We can see that different models have different variability in performance for different metrics of interest.


Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer 2006. ISBN: 9780387310732.

Hardt, Moritz. (2016, October 7). “Equality of opportunity in machine learning.” Google AI Blog.

Zhong, Ziyuan. (2018, October 21). “A tutorial on fairness in machine learning.” Toward Data Science.

Kun, Jeremy. (2015, October 19). “One definition of algorithmic fairness: statistical parity.” 

Olteanu, Alex. (2018, January 3). “Tutorial: Learning curves for machine learning in Python.” DataQuest.

Garg, Sahaj, et al. 2018. “Counterfactual fairness in text classification through robustness.” arXiv preprint arXiv:1809.10610.

Wikipedia contributors. (2019, September 6). “Algorithmic bias.” In Wikipedia, The Free Encyclopedia


Content presented by Audace Nakeshimana (MIT).

This content was created by Audace Nakeshimana and Maryam Najafian (MIT).