GitHub - abhishek7997/cite-extract-python: CiteExtract: PDF Citation Sentence Extractor

CiteExtract

A python tool to extract and parse text from PDF files, specifically Research Paper PDFs. The program identifies citations along with the paragraphs they belong to. Additionally, it collects adjectives from the text and categorizes them as positive or negative using a predefined word list.

Features

Extract sentences from pdf
Get citations in text
Get adjectives present in the text and categorize it as positive or negative based on a pre-built wordlist
Uses pdfminer.six library
Removed special characters, formatting tags, unnecessary whitespace, and other artifacts that may have been introduced during PDF parsing.
Extract data in the form of CSV (See image) and TXT file
Enhanced the usability and efficiency of the data by making it easier to process and analyze.

Usage

Store pdf files path inside "papers" folder relative to the directory of the program
Write the key value pair in PAPERS dictionary variable inside CONSTANTS.py file in the following format:

"textN": "./papers/research-paper-name.pdf"

Run citations_text_extractor.py file

Libraries

nltk
pdfminer.six

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
citation_utils.py		citation_utils.py
citations-data.csv		citations-data.csv
citations_text_extractor.py		citations_text_extractor.py
constants.py		constants.py
file_utils.py		file_utils.py
negative-words.txt		negative-words.txt
positive-words.txt		positive-words.txt
results_abhishek.txt		results_abhishek.txt
text_utils.py		text_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CiteExtract

Features

Usage

Libraries

About

Languages

abhishek7997/cite-extract-python

Folders and files

Latest commit

History

Repository files navigation

CiteExtract

Features

Usage

Libraries

About

Resources

Stars

Watchers

Forks

Languages