6.864 | Fall 2005 | Graduate

Advanced Natural Language Processing


Homework 1 (PDF)  
Homework 2 (PDF) counts.gz (GZ - 3.2 MB) (The GZ file contains: counts.txt.)
theirthere.test (TXT)
Homework 3 (PDF) data.gz (GZ) (The GZ file contains: data.txt.)
synrev (TXT)

Development Data

Verb pairs and associated cosine similarity scores (note that sim.in = synrev).
sim.in (TXT)
sim.out (TXT)

Result of complete-link clustering to the 2-cluster level.
cluster1 (TXT)
cluster2 (TXT)

Homework 4 (PDF) poscounts.gz (GZ) (The GZ file contains: poscounts.txt.)
wsj.19-21.test (TXT)

Extra Materials

A package containing the scripts that were used to generate the poscounts.gz corpus. We are providing this code in case you are curious about the data generation. For the purposes of the problem set, however, please use the poscounts.gz training corpus to ensure that your results comply with the reference implementation.
ft.tar.tar (TAR - 2.5 MB)

Development data for testing your tag-trigram probabilities; tritest contains tag trigrams, while tritest.probs contains the corresponding probabilities.
tritest (TXT)
tritest.probs (TXT)

Development data for testing your Viterbi tag assignments. The simplesents file contains about 530 simple sentences that admit relatively few possible tag assignments. The simplesents.bf_tagged file contains optimal tag assignments and log-probabilities as discovered by brute-force enumeration. The first element in every line of simplesents.bf_tagged gives the log-probability of the best tagging, and the rest of the line gives the tag assignment itself.
simplesents (TXT)
simplesents.bf_tagged (TXT)

Homework 5 (PDF) Resources

BoosTexter - The download page for the BoosTexter binaries. If you get an error message, try reloading the link. BoosTexter is UNIX®-based, so if you want to run it in Windows, you will need to get a UNIX® shell such as Cygwin or U/Win.

Penn Treebank Tagset - Descriptions of the Penn part-of-speech tags. NB: When you are determining the plurality of a noun phrase, you will find that the last tag is not always a noun-type tag. Use the following rule to determine the plurality of these other parts of speech:
Plural: CD, JJP, SYM
Singular: DT, JJ, RB, VBG, WDT


Sentence pairs with coreference annotations.
coref_samples.train (TXT)
coref_samples.test (TXT)

BoosTexter .names template for feature generation. Please adhere to this template to ensure that your features conform to the reference results.
coref.names (TXT)

Reference features for the first 30 sentence pairs in coref_samples.train. The feature vectors were generated in a left-to-right postorder traversal of the noun phrases in a given sentence pair; e.g.
[[ [[ 1 ]] 2 ]] [[ 3 ]] [[ 4 ]] [[ [[ 5 ]] [[ [[ 6 ]] 7 ]] 8 ]]
first30.data (TXT)

Homework 6 (PDF) corpus.de.gz (GZ) (The GZ file contains: corpus.de.txt.)
corpus.en.gz (GZ) (The GZ file contains: corpus.en.txt.)


A set of words and their associated translation probabilities. The output file is formatted as a series of lines, where each line contains a number of (German word, translation probability) pairs, all tokens separated by spaces.
devwords (TXT)
devwords.out (TXT)

A set of words for which you must provide output probabilities. Please provide a file testwords.out with the same format as devwords.out above.
testwords (TXT)

Course Info