This problem set asks you to run a relatively simple machine learning algorithm that is designed to work well with text extraction problems on some of the textual data that you encountered in homework assignment 1. The algorithm, called BoosTexter, is described in the following paper:
Schapire, Robert E., and Yoram Singer. "BoosTexter: A Boosting-based System for Text Categorization." Machine Learning 39 (2000). (PDF)
The program that implements these algorithms is available for download and should run on all common operating system from the following URL: http://www.research.att.com/~gsf/cgi-bin/download.cgi?action=list&name=boostexter.
This gives you the option of selecting implementations compiled for various flavors of Unix, Windows (32-bit) and Mac OS X. There is an odd method for agreeing to the license agreement, which involves clicking on "Cancel" when you get a dialog box asking for a username and password, so you can see the license agreement and retrieve the username and password you need. If you then refresh your browser window, it will again ask for the username/password that you just got. For me, this failed when using Safari on a Mac, but worked in Firefox. Your mileage may vary.
Running Boostexter on Windows poses an additional challenge because it really is a Unix program. According to its documentation, it should be possible to run the program under Cygwin, which is a common Unix-like shell environment that is often installed on Windows machines. Unfortunately, our experience actually trying to make this work has been dismal. Jacinda Shelly, a former student in the class, has helped greatly and figured out that the program will run under AT&T's uwin-base (aka ksh) environment. Here are her instructions for making this work on a Windows XP installation:
I hope this helps! I'm glad it's working now. The only annoying part about uwin is that up and down arrow keys will only let you reuse a command, not edit it (at least on my machine).
$ boostexter.exe -n 10 -W 1 -N ngram -S sample -V
Copyright 2001 AT&T. All rights reserved.
Weak Learner parameters:
Window = 1
Classes = all
Expert = NGRAM
C0: -1.199 0.168 0.168
C1: 0.549 -0.549 -0.549
rnd 1: wh-err= 0.724633 th-err= 0.724633 test= 1.0000000 train= 0.3333333
If at all feasible, I would encourage you to run under some Unix-like OS, such as Linux or Mac OS X.
Once you have downloaded the program, look at the README file to see how to run it and to interpret the outputs, and the "sample.*" files to see examples of the input formats needed for the program. Note that the input texts to boostexter must not contain commas, periods or line breaks, because they are part of the formatting of the input files. In addition, I have discovered by the sad experience of having the program go into infinite loops that other symbols listed in *text-replacements* (in the Lisp code) also cause problems: colons and percent signs. These have all been substituted by upper-case symbols in the texts above.
Using the dmss data, try building models using BoosTexter to classify the cases based on text unigrams, bigrams, trigrams, etc., appearing in the data. Note that the parameters -W and -N control the types of features used in learning, and the -n option controls how many rounds of boosting are performed (roughly, how many features are selected). The -l and -z parameters select variations on the algorithm, as described in the paper. If you use the -V option, you can see the features being selected as the program runs. In any case, you can examine the file dmss.shyp after running boostexter to see the model that was generated. The distributed README file explains how to understand these. That document also shows how to run a trained model against the dmss.test data, to see how well it performs on data other than the set it was trained on.
Note: If the goal of cross-validation is not to optimize parameters but to explore the robustness of a single chosen method, it's possible to use the entire training + test set in cross-validation. However, it's very important yet tricky to assure that one does not "corrupt" the final evaluation results when allowing the test data set to participate in any of the training tasks.
As described above, the hw2.* files contain notes with (possibly) multiple labels, from a list of 21 relatively common problems in the CWS. Nevertheless, the most common of these occurs nearly 50 times, and the least common only 4 times in the data set. Thus, the resulting classification problem is harder.