Document Distance: Program Version 6

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries


To improve the running time of our document distance program, we replace insertion_sort with merge_sort, which is known to run in time Θ(nlgn), instead of the Θ(n2) running time of insertion_sort.

Here is code for merge_sort:

def merge_sort(A):
    """
    Sort list A into order, and return result.
    """
    n = len(A)
    if n==1: 
        return A
    mid = n//2     # floor division
    L = merge_sort(A[:mid])
    R = merge_sort(A[mid:])
    return merge(L,R)

def merge(L,R):
    """
    Given two sorted sequences L and R, return their merge.
    """
    i = 0
    j = 0
    answer = []
    while i<len(L) and j<len(R):
        if L[i]<R[j]:
            answer.append(L[i])
            i += 1
        else:
            answer.append(R[j])
            j += 1
    if i<len(L):
        answer.extend(L[i:])
    if j<len(R):
        answer.extend(R[j:])
    return answer

The revised document distance program is now docdist6.py (PY).

Running this on our standard inputs, we obtain:

>docdist6.py t2.bobsey.txt t3.lewis.txt 
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words
The distance between the documents is: 0.574160 (radians)
         885435 function calls (861671 primitive calls) in 6.630 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
   135633    0.486    0.000    0.486    0.000 :0(append)
    34545    0.132    0.000    0.132    0.000 :0(extend)
   232140    0.760    0.000    0.760    0.000 :0(has_key)
        2    0.018    0.009    0.018    0.009 :0(items)
   379456    1.327    0.000    1.327    0.000 :0(len)
        2    0.001    0.000    0.001    0.000 :0(open)
        2    0.014    0.007    0.014    0.007 :0(readlines)
        1    0.004    0.004    0.004    0.004 :0(setprofile)
    22663    0.146    0.000    0.146    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
    22663    0.079    0.000    0.079    0.000 :0(translate)
        1    0.003    0.003    6.626    6.626 <string>:1(<module>)
        2    0.889    0.445    1.667    0.833 docdist6.py:107(count_frequency)
  23766/2    0.335    0.000    3.900    1.950 docdist6.py:122(merge_sort)
    11882    1.849    0.000    3.481    0.000 docdist6.py:134(merge)
        2    0.001    0.001    6.297    3.148 docdist6.py:176(word_frequencies_for_file)
        3    0.177    0.059    0.316    0.105 docdist6.py:194(inner_product)
        1    0.000    0.000    0.316    0.316 docdist6.py:220(vector_angle)
        1    0.011    0.011    6.624    6.624 docdist6.py:230(main)
        2    0.000    0.000    0.014    0.007 docdist6.py:57(read_file)
        2    0.176    0.088    0.714    0.357 docdist6.py:73(get_words_from_line_list)
    22663    0.223    0.000    0.448    0.000 docdist6.py:91(get_words_from_string)
        1    0.000    0.000    6.630    6.630 profile:0(main())
        0    0.000             0.000          profile:0(profiler)

Excellent! We have reduced the overall running time from over three minutes to under six seconds -- almost two orders of magnitude improvement in running time -- and the running time should now scale nearly linearly with the size of the input(s). (Where Θ(nlgn) is "nearly linear".)

We can now attempt to run this on our large inputs, such as compared the complete works of Shakespeare with the complete works of Winston Churchill:

>docdist6.py t5.churchill.txt t8.shakespeare.txt 
File t5.churchill.txt : 189685 lines, 1717247 words, 32544 distinct words
File t8.shakespeare.txt : 124456 lines, 929462 words, 23881 distinct words
The distance between the documents is: 0.462095 (radians)
         6926117 function calls (6813271 primitive calls) in 52.886 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
   763122    2.828    0.000    2.828    0.000 :0(append)
   370564    1.565    0.000    1.565    0.000 :0(extend)
  2646709    8.770    0.000    8.770    0.000 :0(has_key)
        2    0.280    0.140    0.280    0.140 :0(items)
  2034004    7.302    0.000    7.302    0.000 :0(len)
        2    0.001    0.000    0.001    0.000 :0(open)
        2    0.178    0.089    0.178    0.089 :0(readlines)
        1    0.004    0.004    0.004    0.004 :0(setprofile)
   314141    1.862    0.000    1.862    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
   314141    1.108    0.000    1.108    0.000 :0(translate)
        1    0.025    0.025   52.882   52.882 <string>:1(<module>)
        2   10.369    5.184   19.418    9.709 docdist6.py:107(count_frequency)
 112848/2    1.643    0.000   21.717   10.858 docdist6.py:122(merge_sort)
    56423   10.385    0.000   19.662    0.000 docdist6.py:134(merge)
        2    0.011    0.006   51.227   25.614 docdist6.py:176(word_frequencies_for_file)
        3    0.836    0.279    1.491    0.497 docdist6.py:194(inner_product)
        1    0.000    0.000    1.492    1.492 docdist6.py:220(vector_angle)
        1    0.138    0.138   52.856   52.856 docdist6.py:230(main)
        2    0.000    0.000    0.179    0.089 docdist6.py:57(read_file)
        2    2.451    1.225    9.902    4.951 docdist6.py:73(get_words_from_line_list)
   314141    3.131    0.000    6.101    0.000 docdist6.py:91(get_words_from_string)
        1    0.000    0.000   52.886   52.886 profile:0(main())
        0    0.000             0.000          profile:0(profiler)

Very nice! These are large files (many megabytes), and yet we are able to compute the distance between them fairly efficiently (under a minute). If we were to continue searching for efficiency improvements, we would continue to look at sorting and at the data structures involved in count_frequency.

Exercise: Can you eliminate the need for sorting altogether in this program? Re-code the program to do so.