GitHub - maruscia/autosum: Summarize Publications Automatically

AutoSum: Summarize Publications Automatically

The tool exploits the labor already expended by scholars in summarizing articles. It scrapes words next to citations across all openly available research citing a publication, and collates the output. The result is a very useful summary and data that are in a format that allows easy discovery of potential miscitations.

CLICK HERE to suggest an edit to this page!

Get the Data
Scrapes all openly accessible research citing a particular publication using links provided by Google Scholar. Note: Google monitors scraping on Google scholar.
Parse the Data
Iterates through a directory with all the articles citing a particular research article, and using regular expressions, picks up sentences near a citation.
Example from Social Science

Get the Data

To search for openly accessible pdfs citing the original research article on Google Scholar, use Scholar.py.

Input: URL to Google Scholar Page of an article.
What the script does:
- Goes to 'Cited By..'
- Downloads a user specified number of publicly available papers (pdfs only for now) that cite the paper to a user specified directory.
- Creates a csv that tracks basic characteristics of each of the downloaded paper -- title, url, author names, journal etc. It also dumps relative path to downloaded file.
Sample output

Usage

usage: scholar.py [-h] [-u USER] [-p PASSWORD] [-a AUTHOR] [-d DIR]
                  [-o OUTPUT] [-n N_CITES] [-v] [--version]
                  keyword [keyword ...]

positional arguments:
  keyword               Keyword to be searched

optional arguments:
  -h, --help            show this help message and exit
  -u USER, --user USER  Google account e-mail
  -p PASSWORD, --password PASSWORD
                        Google account password
  -a AUTHOR, --author AUTHOR
                        Author to be filtered
  -d DIR, --dir DIR     Output directory for PDF files
  -o OUTPUT, --output OUTPUT
                        CSV output filename
  -n N_CITES, --n-cites N_CITES
                        Number of cites to be download
  -v, --verbose
  --version             show program's version number and exit

Example

python scholar.py -v -d pdfs -o output.csv -n 100 -a "A Einstein" \
"Can quantum-mechanical description of physical reality be considered complete?"

Parse the Data

To scrape the text next to the relevant citations within the pdfs, use autosumpdf.py:

The script iterates through the pdfs using the csv generated above.
Using citation information, or a custom regexp gets the text and puts it in the same csv. If multiple regex are matched, everything is concatenated with a line space.
Sample output

usage: searchpdf.py [-h] [-i INPUT] [-o OUTPUT] [-v] [--version]
                    regex [regex ...]

optional arguments:
   -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        CSV input filename
  -o OUTPUT, --output OUTPUT
                        CSV output filename
  -t TXT_DIR, --text TXT_DIR
                        extract to specific directory
  -f, --force           force extract text file if exists
  -v, --verbose
  -a1 AUTHOR1, --author-1-lastname AUTHOR1
                        1st author of citation
  -a2 AUTHOR2, --author-2-lastname AUTHOR2
                        2nd author of citation
  -y YEAR, --year YEAR  Year of publication
  --version             show program's version number and exit
  -r REGEX, --regex REGEX
                        specify custom regex to filter citations.

Example

python searchpdf.py -v -i output.csv -o search-output.csv -r "\.\s(.{5,100}[\[\(]?Einstein.{2,30}\d+[\]\)])"

The custom regular expression (-r switch) matches a sentence (max 100 chars) following by author name "Einstein", any words (max 30 chars) and number with close bracket at the end.

Depending on the command line arguments (-a1, -a2, -y) the following citation patterns will be automatically used for finding matching sentences:

Author1_Last_Name Year
Author1_Last_Name et al.
Author1_Last_Name et al. Year
Author1_Last_Name et al., Year
Author1_Last_Name and Author2_Last_Name
Author1_Last_Name and Author2_Last_Name Year
Author1_Last_Name, and Author2_Last_Name Year
Author1_Last_Name and Author2_Last_Name, Year
Author1_Last_Name & Author2_Last_Name Year
Author1_Last_Name & Author2_Last_Name, Year

Example from Social Science

What to search for?

Example with Google Scholar
Download 500 articles from Google Scholar:

python scholar.py -v -d pdfs -o iyengar-output.csv -n 500 -a "S Iyengar" "Is anyone responsible?: How television frames political issues."

Searching in the Test Data
- Sample input data
- Use autosumpdf.py to filter citations to Iyengar et al. 2012:
```
python autosumpdf.py -v -i testdata.csv -o search-testdata-new.csv -a1 "Iyengar" -y "2012"
```
Miscitations
Social scientists hold that few truths are self-evident. But some truths become obvious to all social scientists after some years of experience, including: a) Peer review is a mess, b) Faculty hiring is idiosyncratic, and c) Research is often miscited. Here we quantify the last portion.

License

Released under the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
scripts		scripts
testdata		testdata
testout		testout
.gitignore		.gitignore
License.md		License.md
Readme.md		Readme.md
social_science_citations.md		social_science_citations.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoSum: Summarize Publications Automatically

Table of Contents

Get the Data

Usage

Parse the Data

Example from Social Science

License

About

Releases

Packages

Languages

License

maruscia/autosum

Folders and files

Latest commit

History

Repository files navigation

AutoSum: Summarize Publications Automatically

Table of Contents

Get the Data

Usage

Parse the Data

Example from Social Science

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages