Authors: Jacob Andreas, Catherine D’Ignazio, Harini Suresh
Keywords: data annotation; natural language processing; machine learning; content moderation
- Critically question how and by whom the data was created
- Determine what its limitations might be
- Discuss what the data should and should not be used for
Part 0: Short-answer reflections to a few hypothetical scenarios.
Part 1: This part is done in groups. You will have the role of a researcher in charge of creating a dataset. You’ll receive a hypothetical task, make decisions about what labels you want to collect, and write instructions to a group of annotators.
- Dataset Creation Part 1 (PDF) (DOCX)
Part 2: This part is done individually. You’ll now be an annotator. First, take the instructions you wrote and annotate a new set of examples according to them. Then, you will receive instructions from a different group, for a different task, and will be asked to annotate a set of examples by following their instructions.
Part 3: Pick one of the listed readings to read & respond to. See assignment description for titles.
Paullada, Amandalynne, Inioluwa Deborah Raji, et al. “Data and Its (Dis) Contents: A Survey of Dataset Development and Use in Machine Learning Sesearch.” arXiv preprint arXiv:2012.05345 (2020).
D’Ignazio, Catherine and Lauren Klein. “What Gets Counted Counts.” Chapter 4 in Data Feminism. March 16, 2020.
Gebru, Timnit, Jamie Morgenstern, et al. “Datasheets for datasets (PDF - 2.1MB).” arXiv preprint arXiv:1803.09010 (2018).
Bhuiyan, M. Momen, Amy X. Zhang, Connie Moon Sehat, and Tanushree Mitra. “Investigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria (PDF).” Proceedings of the ACM on Human-Computer Interaction 4, no. CSCW2 (2020): 1-26.
Metz, Cade. “A.I. is Learning From Humans. Many Humans.” The New York Times. Aug. 16, 2019.
Kaye, Kate. “These Companies Claim to Provide ‘Fair-Trade’ Data Work. Do They?” Technology Review. Aug. 7, 2019.