6.864 Quantitative Methods for Natural Language Processing

> Related Topics: AI and Algorithms, Ethical Computing and Practice

Authors: Jacob Andreas, Catherine D’Ignazio, Harini Suresh

Keywords: data annotation; natural language processing; machine learning; content moderation

Topics addressed:

Critically question how and by whom the data was created
Determine what its limitations might be
Discuss what the data should and should not be used for

Resources:

Part 0: Short-answer reflections to a few hypothetical scenarios.

Dataset Creation Part 0 (PDF) (DOCX)

Part 1: This part is done in groups. You will have the role of a researcher in charge of creating a dataset. You’ll receive a hypothetical task, make decisions about what labels you want to collect, and write instructions to a group of annotators.

Dataset Creation Part 1 (PDF) (DOCX)
- Task A: Comment Moderation (PDF) (DOCX)
- Task B: Credibility Evaluation (PDF) (DOCX)

Part 2: This part is done individually. You’ll now be an annotator. First, take the instructions you wrote and annotate a new set of examples according to them. Then, you will receive instructions from a different group, for a different task, and will be asked to annotate a set of examples by following their instructions.

Part 3: Pick one of the listed readings to read & respond to. See assignment description for titles.

Additional Reading:

Paullada, Amandalynne, Inioluwa Deborah Raji, et al. “Data and Its (Dis) Contents: A Survey of Dataset Development and Use in Machine Learning Sesearch.” arXiv preprint arXiv:2012.05345 (2020).

D’Ignazio, Catherine and Lauren Klein. “What Gets Counted Counts.” Chapter 4 in Data Feminism. March 16, 2020.

Gebru, Timnit, Jamie Morgenstern, et al. “Datasheets for datasets (PDF - 2.1MB).” arXiv preprint arXiv:1803.09010 (2018).

Bhuiyan, M. Momen, Amy X. Zhang, Connie Moon Sehat, and Tanushree Mitra. “Investigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria (PDF).” Proceedings of the ACM on Human-Computer Interaction 4, no. CSCW2 (2020): 1-26.

Metz, Cade. “A.I. is Learning From Humans. Many Humans.” The New York Times. Aug. 16, 2019.

Kaye, Kate. “These Companies Claim to Provide ‘Fair-Trade’ Data Work. Do They?” Technology Review. Aug. 7, 2019.

Browse Course Material

Course Info

Instructor

As Taught In

Level

Topics

Learning Resource Types

Social and Ethical Responsibilities of Computing (SERC)

Resources:

Additional Reading:

Course Info

Instructor

As Taught In

Level

Topics

Learning Resource Types

Browse Course Material

Course Info

Instructor

As Taught In

Level

Topics

Learning Resource Types

Social and Ethical Responsibilities of Computing (SERC)

Resources:

Assignment “Dataset Creation” Description (PDF) (DOCX)

Additional Reading:

Course Info

Instructor

As Taught In

Level

Topics

Learning Resource Types