Document Distance: Program Version 2

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries


In order to figure out why our initial program is so slow, we now "instrument" the program so that Python will tell us where the running time is going. This is very simple in Python; we simply use the profile module to print out the desired statistics. (See here for some more details on profile.)

(You can find out about profilers for other languages from Wikipedia's article.)

More precisely, the end of the program is changed from:

if __name__ == "__main__":
    main()

to

if __name__ == "__main__":
    import profile
    profile.run("main()")

(We also rename the program as docdist2.py (PY)

If we now try the command

docdist2 t2.lewis.txt t3.lewis.txt

we get the output:

File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words
The distance between the documents is: 0.574160 (radians)
         3816334 function calls in 194.823 CPU seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
  1241849    4.839    0.000    4.839    0.000 :0(append)
  1277585    4.913    0.000    4.913    0.000 :0(isalnum)
   232140    0.977    0.000    0.977    0.000 :0(join)
   345651    1.309    0.000    1.309    0.000 :0(len)
   232140    0.813    0.000    0.813    0.000 :0(lower)
        2    0.001    0.000    0.001    0.000 :0(open)
        2    0.000    0.000    0.000    0.000 :0(range)
        2    0.013    0.007    0.013    0.007 :0(readlines)
        1    0.006    0.006    0.006    0.006 :0(setprofile)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
        1    0.006    0.006  194.816  194.816 <string>:1(<module>)
        2   44.988   22.494   45.030   22.515 docdist2.py:105(count_frequency)
        2   12.253    6.126   12.253    6.127 docdist2.py:122(insertion_sort)
        2    0.000    0.000  194.469   97.235 docdist2.py:144(word_frequencies_for_file)
        3    0.183    0.061    0.326    0.109 docdist2.py:162(inner_product)
        1    0.000    0.000    0.327    0.327 docdist2.py:188(vector_angle)
        1    0.015    0.015  194.811  194.811 docdist2.py:198(main)
        2    0.000    0.000    0.014    0.007 docdist2.py:49(read_file)
        2  107.335   53.668  137.172   68.586 docdist2.py:65(get_words_from_line_list)
    22663   13.938    0.001   29.837    0.001 docdist2.py:77(get_words_from_string)
        1    0.000    0.000  194.823  194.823 profile:0(main())
        0    0.000             0.000          profile:0(profiler)
   232140    1.606    0.000    2.419    0.000 string.py:218(lower)
   232140    1.627    0.000    2.604    0.000 string.py:306(join)

tottime is the most interesting column; it gives the total CPU spent on each routine, exclusive of calls to other subroutines.

Ha! I bet you thought the culprit was insertion_sort! That took only 12 seconds. The real culprit seems to be get_words_from_line_list which took 107 seconds (exclusive of its calls to get_words_from_string, which took only 14 seconds). What is wrong with get_words_from_line_list? Here it is again:

def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """

    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list = word_list + words_in_line
    return word_list

Looks pretty simple -- what is wrong???