pdfx v1.9
Fully-automated PDF-to-XML conversion of scientific articles
dx.doi.org/10.1145/2494266.2494271
Overview
PDFX is a fully-automated PDF-to-XML converter for scientific articles.
It takes a full-text PDF article as input (example) and outputs the hierarchy of its distinct logical elements in an XML format.
The elements that PDFX can currently extract are:
-
Front Matter
- title, abstract, author, author footnote
Body Matter
- body text, h1, h2, h3, image, table, figure/table caption, figure/table reference, bibliographic item, bibliographic reference (citation)
Extras
- header, footer, side note, page number, email, URI
Note: This system has been designed for processing scientific articles.
While virtually any PDF file is acceptable input, quality of the processing
output and/or processing time might be degraded e.g. for books, slide presentations or spreadsheets/strictly tabular data.
Usage
There are two ways in which you can use PDFX:
- via a web browser
- via any other HTTP client, such as the curl command-line tool
- The Web Interface
-
Allows submission of single PDF articles. Once you click the "Submit" button, the article will
be processed on-the-fly. Depending on the size/complexity of the article, processing may take a while, so
please be patient. A typical 10-page article will normally take ~15-20 seconds to process.
Once processing is complete, you will be redirected to the job details page, which
provides three options of interacting with the output:
- access/retrieve the generated XML version directly.
- view a reconstruction of the article in HTML form, using the generated XML. The HTML presents the core content of the original article as a single-column stream of text, free from elements such as headers, footers or side notes.
- download an archive containing the entire output (including rendered images) for offline viewing.
-
Because no authentication is required at this time, input and output files for each processing
job are stored for 24 hours since the time of submission, under randomly-generated job IDs.
Each job ID is used to construct the URL paths to the output options mentioned above,
as follows:
- pdfx.cs.man.ac.uk/job_id for the job details page
- pdfx.cs.man.ac.uk/job_id.xml for the XML
- pdfx.cs.man.ac.uk/job_id.html for the HTML
- pdfx.cs.man.ac.uk/job_id.tar.gz for the archive
Additionally, you can also access
- pdfx.cs.man.ac.uk/job_id.pdf for a
back-reference to the original PDF
- The Command-line
-
A typical curl request from a Unix shell is as follows:
curl --data-binary @"/path/to/my.pdf"
-H "Content-Type: application/pdf"
-L "pdfx.cs.man.ac.uk"
Provided the submitted file is a valid PDF document no larger than 5MB and no longer than 100 pages, it will be processed on-the-fly.
The reponse will be its XML version.
Batch Processing
A simple extension to the above example processes all the documents in a collection in one go,
by looping over the PDF files in a directory and passing them to PDFX one by one:
find /path/to/my/collection/ -name "*.pdf" |
while read file;
do
curl --data-binary @"$file"
-H "Content-Type: application/pdf"
-L "pdfx.cs.man.ac.uk" > "${file}x.xml";
done
The above command will sequentially save the output for each .pdf file from the
directory /path/to/my/collection/ to .pdfx.xml,
in the same directory.
Please note that the shell redirection symbol '>' will overwrite any pre-existing
.pdfx.xml files without notice.
Note: You are advised to get in touch for jobs exceeding 1000 PDFs. We will likely be able to help speed up the process and maintain the server load down.
Service abuse, in number of submissions or frequency of requests, will result in automatic blacklisting.
- Other Clients
-
Various other methods may be used to invoke the service in a similar fashion as above. PDFX will respond to a valid request directly with the XML output.
- Offline Viewing
-
Each job_id.tar.gz archive made available through the web interface has the following contents:
- a short_id/ directory with the original PDF, the created XML and any rendered images:
- short_id.pdf
- short_id.pdfx.xml
- short_id.page_XXX.image_XX.png
- short_id.page_XXX.image_YY.png
- short_id.html
Additional files are needed for properly viewing the HTML version offline. These can be found in the
static/ folder of the static.zip archive, available for download.
This folder should
reside at the same level as short_id/ and short_id.html.
For convenience, it is recommended that you place a single static/ folder along with all
(short_id/, short_id.html) pairs into the same parent folder.
E.g. For the jobs with short_ids #1, #2 and #30, the folder structure should be:
- my_dir/
- my_dir/1/
- my_dir/2/
- my_dir/30/
- my_dir/static/
- my_dir/1.html
- my_dir/2.html
- my_dir/30.html
static.zip - one-time download for offline viewing
- Notes on Web Browsers
-
Offline viewing is not possible with Google Chrome, due to the browser's built-in security constraints.
Older versions of Opera might prompt to save json responses rather than interpret them. If this happens, go to Prefences >
Advanced > Downloads and set the MIME type "application/json" to be opened with Opera.
- XML Schema
-
PDFX's XML format is very close in schema to the Journal Archiving and Interchange Tag Set of the JATS Standard.
Most elements can therefore be transformed in compliance to the respective JATS/NLM DTDs.
- PDFX to NLM 3.0 XSL (courtesy of PKP)
Note: The transformation cannot be guaranteed to validate entirely against the DTDs and may leave out PDFX elements in the attempt to do so. Contributions towards a more elaborate XSL handling are welcome.
Links to Remember
- Submit new PDF
/
- Job Details
/job_id
- HTML Version
/job_id.html
- XML Version
/job_id.xml
- Archive Version
/job_id.tar.gz
- Original PDF
/job_id.pdf
Usage |
Contact