Skip to content

maruscia/autosum

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoSum: Summarize Publications Automatically

The tool exploits the labor already expended by scholars in summarizing articles. It scrapes words next to citations across all openly available research citing a publication, and collates the output. The result is a very useful summary and data that are in a format that allows easy discovery of potential miscitations.

CLICK HERE to suggest an edit to this page!


Table of Contents

  • Get the Data
    Scrapes all openly accessible research citing a particular publication using links provided by Google Scholar. Note: Google monitors scraping on Google scholar.

  • Parse the Data
    Iterates through a directory with all the articles citing a particular research article, and using regular expressions, picks up sentences near a citation.

  • Example from Social Science


Get the Data

To search for openly accessible pdfs citing the original research article on Google Scholar, use Scholar.py.

  1. Input: URL to Google Scholar Page of an article.
  2. What the script does:
    • Goes to 'Cited By..'
    • Downloads a user specified number of publicly available papers (pdfs only for now) that cite the paper to a user specified directory.
    • Creates a csv that tracks basic characteristics of each of the downloaded paper -- title, url, author names, journal etc. It also dumps relative path to downloaded file.
  3. Sample output
Usage
usage: scholar.py [-h] [-u USER] [-p PASSWORD] [-a AUTHOR] [-d DIR]
                  [-o OUTPUT] [-n N_CITES] [-v] [--version]
                  keyword [keyword ...]

positional arguments:
  keyword               Keyword to be searched

optional arguments:
  -h, --help            show this help message and exit
  -u USER, --user USER  Google account e-mail
  -p PASSWORD, --password PASSWORD
                        Google account password
  -a AUTHOR, --author AUTHOR
                        Author to be filtered
  -d DIR, --dir DIR     Output directory for PDF files
  -o OUTPUT, --output OUTPUT
                        CSV output filename
  -n N_CITES, --n-cites N_CITES
                        Number of cites to be download
  -v, --verbose
  --version             show program's version number and exit

Example

python scholar.py -v -d pdfs -o output.csv -n 100 -a "A Einstein" \
"Can quantum-mechanical description of physical reality be considered complete?"

Parse the Data

To scrape the text next to the relevant citations within the pdfs, use autosumpdf.py:

  1. The script iterates through the pdfs using the csv generated above.
  2. Using citation information, or a custom regexp gets the text and puts it in the same csv. If multiple regex are matched, everything is concatenated with a line space.
  3. Sample output
usage: searchpdf.py [-h] [-i INPUT] [-o OUTPUT] [-v] [--version]
                    regex [regex ...]

optional arguments:
   -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        CSV input filename
  -o OUTPUT, --output OUTPUT
                        CSV output filename
  -t TXT_DIR, --text TXT_DIR
                        extract to specific directory
  -f, --force           force extract text file if exists
  -v, --verbose
  -a1 AUTHOR1, --author-1-lastname AUTHOR1
                        1st author of citation
  -a2 AUTHOR2, --author-2-lastname AUTHOR2
                        2nd author of citation
  -y YEAR, --year YEAR  Year of publication
  --version             show program's version number and exit
  -r REGEX, --regex REGEX
                        specify custom regex to filter citations.

Example

python searchpdf.py -v -i output.csv -o search-output.csv -r "\.\s(.{5,100}[\[\(]?Einstein.{2,30}\d+[\]\)])"

The custom regular expression (-r switch) matches a sentence (max 100 chars) following by author name "Einstein", any words (max 30 chars) and number with close bracket at the end.

Depending on the command line arguments (-a1, -a2, -y) the following citation patterns will be automatically used for finding matching sentences:

  • Author1_Last_Name Year
  • Author1_Last_Name et al.
  • Author1_Last_Name et al. Year
  • Author1_Last_Name et al., Year
  • Author1_Last_Name and Author2_Last_Name
  • Author1_Last_Name and Author2_Last_Name Year
  • Author1_Last_Name, and Author2_Last_Name Year
  • Author1_Last_Name and Author2_Last_Name, Year
  • Author1_Last_Name & Author2_Last_Name Year
  • Author1_Last_Name & Author2_Last_Name, Year

Example from Social Science

  • What to search for?

    • Example with Google Scholar
      Download 500 articles from Google Scholar:
      python scholar.py -v -d pdfs -o iyengar-output.csv -n 500 -a "S Iyengar" "Is anyone responsible?: How television frames political issues."
      
  • Searching in the Test Data

    • Sample input data
    • Use autosumpdf.py to filter citations to Iyengar et al. 2012:
      python autosumpdf.py -v -i testdata.csv -o search-testdata-new.csv -a1 "Iyengar" -y "2012"
      
  • Miscitations
    Social scientists hold that few truths are self-evident. But some truths become obvious to all social scientists after some years of experience, including: a) Peer review is a mess, b) Faculty hiring is idiosyncratic, and c) Research is often miscited. Here we quantify the last portion.

License

Released under the MIT License

About

Summarize Publications Automatically

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%