English Word Frequencies

Each txt file contains lines in the format word,freqency. All words are in lowercase. The frequency is an integer that represents how common the word is. For more details, look at the script process.py to see exactly how the original data was converted to this format. The uncompressed directory contains the raw text files for each letter and for all of the letters combined. The compressed directory contains the same files compressed using gzip.

This dataset was created using information from the Google Ngram viewer. Each of the txt files was created by running process.py on the 1-gram file for each letter. All of the 1-grams with non-alphabetic characters have been removed, so words listed here only include the letters a-z.

License

The source data for this dataset is licensed under the Creative Commons Attribution 3.0 Unported License. Apart from that I really don't care what you use this for.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
compressed		compressed
uncompressed		uncompressed
README.md		README.md
process.py		process.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

English Word Frequencies

License

About

Releases

Packages

Languages

PAndaContron/EnglishWordFrequencies

Folders and files

Latest commit

History

Repository files navigation

English Word Frequencies

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages