pdfx v1.9

Fully-automated PDF-to-XML conversion of scientific articles dx.doi.org/10.1145/2494266.2494271

Overview

PDFX is a fully-automated PDF-to-XML converter for scientific articles. It takes a full-text PDF article as input (example) and outputs the hierarchy of its distinct logical elements in an XML format.

The elements that PDFX can currently extract are:

Front Matter

title, abstract, author, author footnote

Body Matter

body text, h1, h2, h3, image, table, figure/table caption, figure/table reference, bibliographic item, bibliographic reference (citation)

Extras

header, footer, side note, page number, email, URI

Note: This system has been designed for processing scientific articles. While virtually any PDF file is acceptable input, quality of the processing output and/or processing time might be degraded e.g. for books, slide presentations or spreadsheets/strictly tabular data.

Usage

There are two ways in which you can use PDFX:

via a web browser
via any other HTTP client, such as the curl command-line tool

The Web Interface

Allows submission of single PDF articles. Once you click the "Submit" button, the article will be processed on-the-fly. Depending on the size/complexity of the article, processing may take a while, so please be patient. A typical 10-page article will normally take ~15-20 seconds to process. Once processing is complete, you will be redirected to the job details page, which provides three options of interacting with the output:

access/retrieve the generated XML version directly.
view a reconstruction of the article in HTML form, using the generated XML. The HTML presents the core content of the original article as a single-column stream of text, free from elements such as headers, footers or side notes.
download an archive containing the entire output (including rendered images) for offline viewing.

Because no authentication is required at this time, input and output files for each processing job are stored for 24 hours since the time of submission, under randomly-generated job IDs. Each job ID is used to construct the URL paths to the output options mentioned above, as follows:

pdfx.cs.man.ac.uk/job_id for the job details page
pdfx.cs.man.ac.uk/job_id.xml for the XML
pdfx.cs.man.ac.uk/job_id.html for the HTML
pdfx.cs.man.ac.uk/job_id.tar.gz for the archive

Additionally, you can also access

pdfx.cs.man.ac.uk/job_id.pdf for a back-reference to the original PDF

The Command-line

A typical curl request from a Unix shell is as follows:

curl --data-binary @"/path/to/my.pdf" -H "Content-Type: application/pdf" -L "pdfx.cs.man.ac.uk"

Provided the submitted file is a valid PDF document no larger than 5MB and no longer than 100 pages, it will be processed on-the-fly. The reponse will be its XML version.

Batch Processing

A simple extension to the above example processes all the documents in a collection in one go, by looping over the PDF files in a directory and passing them to PDFX one by one:

find /path/to/my/collection/ -name "*.pdf" | while read file;

curl --data-binary @"$file" -H "Content-Type: application/pdf" -L "pdfx.cs.man.ac.uk" > "${file}x.xml";

done

The above command will sequentially save the output for each .pdf file from the directory /path/to/my/collection/ to .pdfx.xml, in the same directory.
Please note that the shell redirection symbol '>' will overwrite any pre-existing .pdfx.xml files without notice.

Note: You are advised to get in touch for jobs exceeding 1000 PDFs. We will likely be able to help speed up the process and maintain the server load down.

Service abuse, in number of submissions or frequency of requests, will result in automatic blacklisting.

Other Clients

Various other methods may be used to invoke the service in a similar fashion as above. PDFX will respond to a valid request directly with the XML output.

PHP client example

Offline Viewing

Each job_id.tar.gz archive made available through the web interface has the following contents:

a short_id/ directory with the original PDF, the created XML and any rendered images:
- short_id.pdf
- short_id.pdfx.xml
- short_id.page_XXX.image_XX.png
- short_id.page_XXX.image_YY.png
short_id.html

Additional files are needed for properly viewing the HTML version offline. These can be found in the static/ folder of the static.zip archive, available for download. This folder should reside at the same level as short_id/ and short_id.html.
For convenience, it is recommended that you place a single static/ folder along with all (short_id/, short_id.html) pairs into the same parent folder.

E.g. For the jobs with short_ids #1, #2 and #30, the folder structure should be:

my_dir/
my_dir/1/
my_dir/2/
my_dir/30/
my_dir/static/
my_dir/1.html
my_dir/2.html
my_dir/30.html

static.zip - one-time download for offline viewing

Notes on Web Browsers

Offline viewing is not possible with Google Chrome, due to the browser's built-in security constraints.

Older versions of Opera might prompt to save json responses rather than interpret them. If this happens, go to Prefences > Advanced > Downloads and set the MIME type "application/json" to be opened with Opera.

XML Schema

PDFX's XML format is very close in schema to the Journal Archiving and Interchange Tag Set of the JATS Standard. Most elements can therefore be transformed in compliance to the respective JATS/NLM DTDs.

PDFX to NLM 3.0 XSL (courtesy of PKP)

Note: The transformation cannot be guaranteed to validate entirely against the DTDs and may leave out PDFX elements in the attempt to do so. Contributions towards a more elaborate XSL handling are welcome.

Links to Remember

Submit new PDF: /
Job Details: /job_id
HTML Version: /job_id.html
XML Version: /job_id.xml
Archive Version: /job_id.tar.gz
Original PDF: /job_id.pdf

Usage | Contact