Document Distance: Data Sets

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries


Here are nine sample text files, mostly from Project Gutenberg for use as input files for the document distance problem:

  • t1.verne.txt: Verne's In the Year 2889 (TXT)
  • t2.bobsey.txt: Hope's The Bobsey Twins on Blueberry Island (TXT)
  • t3.lewis.txt: Lewis and Clark's History of the Expedition under the Command of Captains Lewis and Clark (Vol. I) (TXT - 1MB)
  • t4.arabian.txt: Anon's The Arabian Nights Entertainments Complete (TXT - 2.9MB)
  • t5.churchill.txt: Churchill's The Complete Works of Winson Churchill (TXT - 9.1MB)
  • t6.onemillion.txt: List of one million integers (from 000000 to 999999) (TXT - 7.6MB)
  • t7.tenmillion.txt: List of ten million integers (from 0000000 to 9999999) (TXT - 86MB)
  • t8.shakespeare.txt: The Complete Works of William Shakespeare (TXT - 5.3MB)
  • t9.bacon.txt: Essays by Francis Bacon (TXT)