Document Distance: Program Using Dictionaries

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Program Using Dictionaries


Here is one more version of the document distance code which uses hashing more thoroughly (PY). It achieves a running time of Θ(n). This linear time bound is optimal because any solution must at least look at the input.

Changes

This version makes three changes compared to Document Distance: Program Version 6.

First, count_frequency no longer converts the dictionary it computes into a list of items. This will be useful for computing the inner products later. The only changed line is the final return, which is now return D instead of return D.items().

Second, word_frequencies_for_file no longer calls merge_sort on the frequency mapping. Because the frequencies are stored in a dictionary, we no longer need to sort them.

The third and main change is the new version of inner_product, which works directly with dictionaries instead of merging sorted lists:

def inner_product(D1,D2):
    """
    Inner product between two vectors, where vectors
    are represented as dictionaries of (word,freq) pairs.

    Example: inner_product({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2}) = 14.0
    """
    sum = 0.0<
    for key in D1:
        if key in D2:
            sum += D1[key] * D2[key]
    return sum

The code is actually simpler now, because we no longer need the code for sorting.

Performance

This version runs about three times faster on our canonical example, t2.bobsey.txt t3.lewis.txt.