Document Distance: Program Using Dictionaries
Here is one more version of the document distance code which uses hashing more thoroughly (PY). It achieves a running time of Θ(n). This linear time bound is optimal because any solution must at least look at the input.
This version makes three changes compared to Document Distance: Program Version 6.
First, count_frequency no longer converts the dictionary it computes into a list of items. This will be useful for computing the inner products later. The only changed line is the final return, which is now return D instead of return D.items().
Second, word_frequencies_for_file no longer calls merge_sort on the frequency mapping. Because the frequencies are stored in a dictionary, we no longer need to sort them.
The third and main change is the new version of inner_product, which works directly with dictionaries instead of merging sorted lists:
The code is actually simpler now, because we no longer need the code for sorting.
This version runs about three times faster on our canonical example, t2.bobsey.txt t3.lewis.txt.