Skip to content

Latest commit

 

History

History
9 lines (5 loc) · 1.11 KB

README.md

File metadata and controls

9 lines (5 loc) · 1.11 KB

English Word Frequencies

Each txt file contains lines in the format word,freqency. All words are in lowercase. The frequency is an integer that represents how common the word is. For more details, look at the script process.py to see exactly how the original data was converted to this format. The uncompressed directory contains the raw text files for each letter and for all of the letters combined. The compressed directory contains the same files compressed using gzip.

This dataset was created using information from the Google Ngram viewer. Each of the txt files was created by running process.py on the 1-gram file for each letter. All of the 1-grams with non-alphabetic characters have been removed, so words listed here only include the letters a-z.

License

The source data for this dataset is licensed under the Creative Commons Attribution 3.0 Unported License. Apart from that I really don't care what you use this for.