PDF Word Extraction

This tool is designed to extract meaningful words from a collection of PDF documents. The extracted words are processed and their frequencies are counted. This frequency data can be used for various text analysis and visualization tasks, such as generating word clouds or identifying common themes in the document collection.

The tool leverages the modern text data toolchain in Python:

pypdf: for reading PDFs.
ftfy: for text cleaning.
SpaCy: for natural language processing such as tokenization, lemmatization, and stop-word removal.

The tool also provides customizable features such as the ability to specify words for removal or replacement.

Workflow

Clone the repository:

git clone https://github.com/nanxstats/pdf-word-extraction.git

Create a virtual environment inside the cloned repository, activate it, and install the required Python packages into the virtual environment:

cd pdf-word-extraction
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Put the PDF files under pdf/, run

python3 pdf_word_extraction.py

If you use VS Code, open the project and select the recommended "venv" Python interpreter. Edit the list of words to remove and replace in pdf_word_extraction.py, save the file and run it again in terminal.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
pdf		pdf
.gitignore		.gitignore
README.md		README.md
pdf_word_extraction.py		pdf_word_extraction.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Word Extraction

Workflow

About

Releases

Packages

Languages

nanxstats/pdf-word-extraction

Folders and files

Latest commit

History

Repository files navigation

PDF Word Extraction

Workflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages