Document Distance: Program Version 1

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries


The initial version of our program for computing the distance between two documents (PY)

This program seems to give correct results. Here is some output:

>docdist1.py t1.verne.txt t2.bobsey.txt 
File t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.582949 (radians)

>docdist1.py t2.bobsey.txt t2.bobsey.txt 
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
The distance between the documents is: 0.000000 (radians)

>docdist1.py t2.bobsey.txt t3.lewis.txt 
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words
The distance between the documents is: 0.574160 (radians)

However, this program seems very SLOW as the inputs get large.

The last example above seemed to take approximately THREE MINUTES!

There seems to be no hope of comparing all of Shakespeare's works to all of Churchill's in a reasonable amount of time...

What is wrong with the efficiency of this program?

Can you figure it out?