Document Distance: Program Using Dictionaries

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Program Using Dictionaries

Here is one more version of the document distance code which uses hashing more thoroughly (PY). It achieves a running time of Θ(n). This linear time bound is optimal because any solution must at least look at the input.


This version makes three changes compared to Document Distance: Program Version 6.

First, count_frequency no longer converts the dictionary it computes into a list of items. This will be useful for computing the inner products later. The only changed line is the final return, which is now return D instead of return D.items().

Second, word_frequencies_for_file no longer calls merge_sort on the frequency mapping. Because the frequencies are stored in a dictionary, we no longer need to sort them.

The third and main change is the new version of inner_product, which works directly with dictionaries instead of merging sorted lists:

def inner_product(D1,D2):
    Inner product between two vectors, where vectors
    are represented as dictionaries of (word,freq) pairs.

    Example: inner_product({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2}) = 14.0
    sum = 0.0<
    for key in D1:
        if key in D2:
            sum += D1[key] * D2[key]
    return sum

The code is actually simpler now, because we no longer need the code for sorting.


This version runs about three times faster on our canonical example, t2.bobsey.txt t3.lewis.txt.