# Document Distance: Program Using Dictionaries

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Program Using Dictionaries

Here is one more version of the document distance code which uses hashing more thoroughly (PY). It achieves a running time of Θ(*n*). This linear time bound is optimal because any solution must at least look at the input.

## Changes

This version makes three changes compared to Document Distance: Program Version 6.

First, `count_frequency` no longer converts the dictionary it computes into a list of items. This will be useful for computing the inner products later. The only changed line is the final return, which is now `return D` instead of `return D.items()`.

Second, `word_frequencies_for_file` no longer calls `merge_sort` on the frequency mapping. Because the frequencies are stored in a dictionary, we no longer need to sort them.

The third and main change is the new version of `inner_product`, which works directly with dictionaries instead of merging sorted lists:

def inner_product(D1,D2): """ Inner product between two vectors, where vectors are represented as dictionaries of (word,freq) pairs. Example: inner_product({"and":3,"of":2,"the":5}, {"and":4,"in":1,"of":1,"this":2}) = 14.0 """ sum = 0.0< for key in D1: if key in D2: sum += D1[key] * D2[key] return sum

The code is actually simpler now, because we no longer need the code for sorting.

## Performance

This version runs about three times faster on our canonical example, `t2.bobsey.txt t3.lewis.txt`.