Document Distance: Program Version 2
Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries
In order to figure out why our initial program is so slow, we now "instrument" the program so that Python will tell us where the running time is going. This is very simple in Python; we simply use the profile module to print out the desired statistics. (See here for some more details on profile.)
(You can find out about profilers for other languages from Wikipedia's article.)
More precisely, the end of the program is changed from:
if __name__ == "__main__": main()
to
if __name__ == "__main__": import profile profile.run("main()")
(We also rename the program as docdist2.py (PY)
If we now try the command
docdist2 t2.lewis.txt t3.lewis.txt
we get the output:
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words The distance between the documents is: 0.574160 (radians) 3816334 function calls in 194.823 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :0(acos) 1241849 4.839 0.000 4.839 0.000 :0(append) 1277585 4.913 0.000 4.913 0.000 :0(isalnum) 232140 0.977 0.000 0.977 0.000 :0(join) 345651 1.309 0.000 1.309 0.000 :0(len) 232140 0.813 0.000 0.813 0.000 :0(lower) 2 0.001 0.000 0.001 0.000 :0(open) 2 0.000 0.000 0.000 0.000 :0(range) 2 0.013 0.007 0.013 0.007 :0(readlines) 1 0.006 0.006 0.006 0.006 :0(setprofile) 1 0.000 0.000 0.000 0.000 :0(sqrt) 1 0.006 0.006 194.816 194.816 <string>:1(<module>) 2 44.988 22.494 45.030 22.515 docdist2.py:105(count_frequency) 2 12.253 6.126 12.253 6.127 docdist2.py:122(insertion_sort) 2 0.000 0.000 194.469 97.235 docdist2.py:144(word_frequencies_for_file) 3 0.183 0.061 0.326 0.109 docdist2.py:162(inner_product) 1 0.000 0.000 0.327 0.327 docdist2.py:188(vector_angle) 1 0.015 0.015 194.811 194.811 docdist2.py:198(main) 2 0.000 0.000 0.014 0.007 docdist2.py:49(read_file) 2 107.335 53.668 137.172 68.586 docdist2.py:65(get_words_from_line_list) 22663 13.938 0.001 29.837 0.001 docdist2.py:77(get_words_from_string) 1 0.000 0.000 194.823 194.823 profile:0(main()) 0 0.000 0.000 profile:0(profiler) 232140 1.606 0.000 2.419 0.000 string.py:218(lower) 232140 1.627 0.000 2.604 0.000 string.py:306(join)
tottime is the most interesting column; it gives the total CPU spent on each routine, exclusive of calls to other subroutines.
Ha! I bet you thought the culprit was insertion_sort! That took only 12 seconds. The real culprit seems to be get_words_from_line_list which took 107 seconds (exclusive of its calls to get_words_from_string, which took only 14 seconds). What is wrong with get_words_from_line_list? Here it is again:
def get_words_from_line_list(L): """ Parse the given list L of text lines into words. Return list of all words found. """ word_list = [] for line in L: words_in_line = get_words_from_string(line) word_list = word_list + words_in_line return word_list
Looks pretty simple -- what is wrong???