Skip to content

A simple tool to process plain text output from Lexis Nexis news searches

Notifications You must be signed in to change notification settings

snauhaus/lexisparse

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 

Repository files navigation

lexisparse

A simple tool to process plain text output from Lexis Nexis news searches

Requirements

  • Python 3.0 or higher

It is possible to write a backward compatible version if necessary

Features

  • Extract individual articles
  • Pull metadata from articles, e.g., lines beginning with [WORD]:
  • Let users define article start and stop boundaries
  • Pull information from Copyright line
  • Pull information from Date line
  • Extract CSV containing metadata
  • Read in from single lexisnexis text files, multiple files or a directory

Example Call

python parse.py -d docs/ -b LENGTH None -c tjtest.csv -o docs/ -m LOAD-DATE LENGTH PUBLICATION-TYPE LANGAUGE SECTION

This asks the script to...

Read in documents from the docs directory, and write out documents to the docs directory. Write an index file including metadata entitled tjtest.csv. Document bodies begin with "LENGTH:" but do not have a consistent ending string.

Full Options

  • -d is where the script will look for your Lexis Nexis text files
  • -b specifies the boundaries above and below the article text (None can be used to include everything above or below) Note that documents are bounded by at the very least [Number] of [Other Number] DOCUMENTS - this is what will be the boundary if it is left blank or set as None. -b LENGTH None specifies that the article body begins after a line that starts with LENGTH: and ends at the end of the document. The boundaries may differ depending on which files you have exported from Lexis Nexis - in the event that a boundary is not found, the article header and footer will be included.
  • -c specifies the name of the csv file to write output to
  • -o specifies the location to write the output text files (one per article)
  • -m specifies the metadata to look for in a document. Not every document needs to have this metadata! It can only find metadata that is followed by a colon at the beginning of a line. Copyright can also be included as a potential metadata field, and will not be expected to begin a line followed by a colon.
  • -dmy is a flag to search for a line containing only a date, and include it in the 'Date' field of the resulting file.
  • -t is a flag to extract the title from the file and add it to the csv. This feature is still somewhat beta.
  • -v verbose, print output for each file scanned. If false, print a progress bar.

About

A simple tool to process plain text output from Lexis Nexis news searches

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%