Document Distance: Program Version 5

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries


The get_words_from_string routine is character-oriented: it must do some processing on each character of the input file(s). Thus, although the running time of this routine is linear in the size of the input, it is nonetheless expensive in the end, because there many are more characters in the file than there are words in the file.

One of the nice things about Python is that it has extensive libraries of built-in routines that are efficiently implemented. This is especially true for string-processing routines; the module string, for example, contains many fast and very useful routines. We'll now use some of these to implement a much faster version of get_words_from_string.

Our strategy is simple:

  • Using the string.translate routine, convert all non-alphanumeric characters to blanks, while simultaneously converting all upper-case letters into lower-case letters. For example:
      string.translate("(Hi) David. What's up?",tab) ==> " hi  david  what s up "

when tab is an appropriate "translation table".

  • Using the routine string.split, split the text line into its constituent words. Applying string.split to the previous string yields: ["hi","david","what","s","up"].

Here is the modified routine

# global variables needed for fast parsing
# translation table maps upper case to lower case and punctuation to spaces
translation_table = string.maketrans(string.punctuation+string.uppercase,
                                     " "*len(string.punctuation)+string.lowercase)

def get_words_from_string(line):
    """
    Return a list of the words in the given input string,
    converting each word to lower-case.

    Input:  line (a string)
    Output: a list of strings 
              (each string is a sequence of alphanumeric characters)
    """
    line = line.translate(translation_table)
    word_list = line.split()
    return word_list

The modified document distance routine is docdist5.py (PY).

Running this on our standard example gives the following output:

docdist5.py t2.bobsey.txt t3.lewis.txt 
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words
The distance between the documents is: 0.574160 (radians)
         366048 function calls in 13.859 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
    22663    0.089    0.000    0.089    0.000 :0(extend)
   232140    0.751    0.000    0.751    0.000 :0(has_key)
        2    0.020    0.010    0.020    0.010 :0(items)
    43228    0.143    0.000    0.143    0.000 :0(len)
        2    0.001    0.000    0.001    0.000 :0(open)
        2    0.000    0.000    0.000    0.000 :0(range)
        2    0.013    0.007    0.013    0.007 :0(readlines)
        1    0.005    0.005    0.005    0.005 :0(setprofile)
    22663    0.144    0.000    0.144    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
    22663    0.077    0.000    0.077    0.000 :0(translate)
        1    0.003    0.003   13.854   13.854 <string>:1(<module>)
        2    0.895    0.447    1.665    0.833 docdist5.py:105(count_frequency)
        2   11.125    5.562   11.125    5.563 docdist5.py:120(insertion_sort)
        2    0.001    0.000   13.518    6.759 docdist5.py:142(word_frequencies_for_file)
        3    0.179    0.060    0.321    0.107 docdist5.py:160(inner_product)
        1    0.000    0.000    0.322    0.322 docdist5.py:186(vector_angle)
        1    0.011    0.011   13.851   13.851 docdist5.py:196(main)
        2    0.000    0.000    0.014    0.007 docdist5.py:55(read_file)
        2    0.176    0.088    0.713    0.356 docdist5.py:71(get_words_from_line_list)
    22663    0.226    0.000    0.447    0.000 docdist5.py:89(get_words_from_string)
        1    0.000    0.000   13.859   13.859 profile:0(main())
        0    0.000             0.000          profile:0(profiler)

Excellent!! Now the only "nail left to hit" is insertion_sort, which takes time Θ(n2) in the worst-case. In order to make this program work well on larger inputs (e.g. the complete works of Shakespeare), we need to replace insertion_sort with something faster.