A web services layer to allow frontend web apps to make use of core ContentMine backends and tools such as norma
(and in future other ContentMine tools such as ami
, getpapers
and quickscrape
).
Note: The initial version is at proof-of-concept status and the API is subject to change.
Target node.js version is: node
v.8.0.0
Created with npm
v5.0.0.
In directory CMServices
use npm start
to start CMServices
as a server.
The port defaults to 3000
. Set the PORT
environment variable to override this. For instance:
$ PORT=3002 && npm start
The server configuration uses default.json
.
As well as host
the following can be configured:
fileStorageCM
The directory used to store all files uploaded and generated by the ContentMine tools. If the path is relative it is interpreted relative to the directory the server is running in.
normaJar
The path to the jar file for ContentMine's norma
, text and data mining application. An example jar is bundled with this project and no installation is needed for initial use.
Note: This API is subject to change
/api/createCorpus
Create a new corpus containing the uploaded PDF document.
HTTP verb: POST
Form data: multipart
Fields
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus to create. This will be used as a directory name
docName
The document name to use for the uploaded PDF (e.g., a DOI). This will be used as a directory name and should be unique within the corpus
A PDF file
Example usage:
curl --form userWorkspace="user1" --form corpusName="corpus1" --form docName="doc1" --form "fileupload=@testpdf.pdf" http://localhost:3002/api/corpus
/api/transformPDF2SVG
For all PDF documents in the corpus, generate an SVG file for each page. This converts pages into the intermediate SVG format used for data extraction and analysis by norma
.
HTTP verb: POST
Form data: x-www-form-urlencoded \
Fields
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus to create. This will be used as a directory name.\
Example usage:
curl -d "corpusName=corpus1&userWorkspace=user1" http://localhost:3002/api/transformPDF2SVG
/api/cropbox
Crop document according to coordinates, dimensions and page number to select a specific area for data extraction using norma. Assumes a single-document corpus.
HTTP verb: POST\ Form data: x-www-form-urlencoded\
Fields
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus.
x0
The x coordinate of the top-left corner of the table
y0
The y coordinate of the top-left corner of the table
width
The width of the table in mm
height
The height of the table in mm
pageNumber
The number of the page containing the table (numbering is relative to the PDF document, so page numbers start at 1).
The coordinate system defaults to ydown
, with y coordinates increasing down the page, and the units to mm
.
Example usage:
curl -d userWorkspace=user1 -d corpusName=corpus1 -d x0=17.5 -d y0=26 -d width=178.5 -d height=97.5 -d pageNumber=5 http://localhost:3002/api/cropbox
/api/transformSVGTABLE2HTML
Convert a table in SVG format into semantically structured HTML using norma
transform svgtable2html
. Assumes a single-document corpus.
HTTP verb: POST
Form data: x-www-form-urlencoded
Fields
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus.
Example usage:
curl -d userWorkspace=user1 -d corpusName=corpus1 http://localhost:3002/api/transformSVGTABLE2HTML
/api/transformSVGTABLE2CSV
Convert a table in SVG format into CSV using norma
transform svgtable2csv
. Assumes a single-document corpus.
HTTP verb: POST
Form data: x-www-form-urlencoded
Fields
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus.
Example usage:
curl -d userWorkspace=user1 -d corpusName=corpus1 http://localhost:3002/api/transformSVGTABLE2CSV
/api/getTableHTML/userWorkspace/corpusName/docName
Retrieve the semantically structured data for the previously converted table in HTML format. Assumes transformSVGTABLE2HTML has already been run.
HTTP verb: GET
URL parameters:
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus
docName
The name of the document within the corpus at upload time
Example usage:
curl http://localhost:3002/api/getTableHTML/user1/corpus1/doc1/
returns HTML data to std out.
/api/getTableCSV/userWorkspace/corpusName/docName
Retrieve the structured data for the previously converted table in CSV format. Assumes transformSVGTABLE2CSV has already been run.
HTTP verb: GET
URL parameters:
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus
docName
The name of the document within the corpus at upload time
Example usage:
curl http://localhost:3002/api/getTableCSV/user1/corpus1/doc1/
returns HTML data to std out.
/api/extractTableToHTML
Extract the table data from the specific page and area of uploaded PDF document and return results in semantically structured HTML
HTTP verb: POST
Form data: multipart
Fields
userWorkspace
A relative directory name in which all this user's files are stored
corpusName
The name of the corpus to create. This will be used as a directory name.
docName
The document name to use for the uploaded PDF (e.g., a DOI). This will be used as a directory name and should be unique within the corpus.
x0
The x coordinate of the top-left corner of the table
y0
The y coordinate of the top-left corner of the table
width
The width of the table in mm
height
The height of the table in mm
pageNumber
The number of the page containing the table (numbering is relative to the PDF document, so page numbers start at 1).
A PDF file
Example usage:
curl --form userWorkspace="user1" --form corpusName="corpus1" --form docName="doc1" --form "fileupload=@testpdf.pdf" --form x0=17.5 --form y0=26 --form width=178.5 --form height=97.5 --form pageNumber=5 http://localhost:3002/api/extractTableToHTML
Use
/api/extractTableToHTML
with form fields as above.
- Upload the PDF document
/api/createCorpus
- Convert it to SVG (ContentMine
norma
intermediate format)
/api/transformPDF2SVG
- Crop the specified area of the specified page to leave only the table in SVG.
/api/cropbox
- Extract data and semantic structure from the table SVG. Output as HTML or CSV.
/api/transformSVGTABLE2HTML
,/api/transformSVGTABLE2CSV
\ - Retrieve extracted/structured data results after conversion:
/api/getTableHTML
,/api/getTableCSV