Skip to content

A collection of Java APIs for Xpdf - the open source library for operating on PDF files.

License

Notifications You must be signed in to change notification settings

codyfrehr/xpdf-api

Repository files navigation

Xpdf API

badge badge badge badge
1231726002137464953?label=Discord&logo=discord

Xpdf API is a collection of Java APIs for Xpdf, the open source library for operating on PDF files. Xpdf is an invaluable PDF toolkit, and this project aims to make it more accessible to the Java community.

Our primary goals are:

  • Provide full programmatic access to Xpdf.

  • Act as a pure, unobscured interface to Xpdf.

APIs are available for the following Xpdf functions (with more to come):

  • pdftotext - convert PDF files to text

Getting Started

Requirements

  • JDK 8 or later

  • Windows, Linux, or Mac OS

Dependencies

 <dependency>
     <groupId>io.xpdf</groupId>
     <artifactId>pdf-text-api</artifactId>
     <version>1.0.1</version>
 </dependency>

…​or a Spring Boot starter, for fancy developers!

 <dependency>
     <groupId>io.xpdf</groupId>
     <artifactId>pdf-text-api-spring-boot-starter</artifactId>
     <version>1.0.1</version>
 </dependency>

Documentation

We strongly recommend downloading sources in your IDE so that you have tooltip access to our JavaDocs. We made an extra effort to provide you with all the help you need, directly from your editor.

We also strongly encourage you to read the Xpdf source documentation for a complete overview of each function and the options available to customize its execution. Documentation can be found alongside the executable files in the package resources, or can be downloaded from Xpdf directly.

PdfText API

PdfText API is an API for pdftotext, a function that converts a PDF file into a text file.

  • It WILL extract text from a PDF file that has embedded text.

  • It WILL NOT extract text from a PDF file that is a scanned image of a document.

Examples

Just convert my PDF file into a text file - who cares how it’s configured!

// initialize the tool
PdfTextTool pdfTextTool = PdfTextTool.builder().build();

// build a request with a PDF file
PdfTextRequest request = PdfTextRequest.builder()
        .pdfFile(new File("~/docs/some.pdf"))
        .build();

// convert the PDF into a text file
PdfTextResponse response = pdfTextTool.process(request);

Convert my PDF file into a text file, but let me customize it - I care about the data!

PdfTextTool pdfTextTool = PdfTextTool.builder().build();

// add some options to customize your request
PdfTextOptions options = PdfTextOptions.builder()
        .format(PdfTextFormat.TABLE)
        .encoding(PdfTextEncoding.UTF_8)
        .ownerPassword("Secret123")
        .nativeOptions(Map.ofEntries(Map.entry("-cfg", "~/configs/xpdfrc")))
        .build();

// build a request with options, and specify an output text file
PdfTextRequest request = PdfTextRequest.builder()
        .pdfFile(new File("~/docs/some.pdf"))
        .textFile(new File("~/docs/some.txt"))
        .options(options)
        .build();

PdfTextResponse response = pdfTextTool.process(request);

PdfTextTool

PdfTextTool represents the Xpdf pdftotext command line tool. It is a simple service that allows you to programmatically execute shell commands against pdftotext, which is included with this project in an executable format for Windows, Linux, and Mac operating systems.

Default configuration of PdfTextTool is to use the executable provided in the package resources, with a 30 second timeout on individual invocations.

PdfTextTool.builder().build();

If you want to use your own installation of pdftotext, then you can download it from Xpdf here. Timeout can also be configured, but unless you are working with truly massive PDF files, most executions happen in under a second.

PdfTextTool.builder()
        .executableFile(new File("~/libs/pdftotext"))
        .timeoutSeconds(60)
        .build();

If you are using our Spring Boot starter, then use the following properties to configure the PdfTextTool bean.

io.xpdf.api.pdf-text:
      executable-path: "~/libs/pdftotext"
      timeout-seconds: 60

PdfTextRequest

PdfTextRequest represents an individual shell command to invoke pdftotext.

A shell command to invoke pdftotext requires an input PDF file and an output text file. Here is a side-by-side comparison of a PdfTextRequest and the corresponding shell command it represents.

PdfTextRequest.builder()
        .pdfFile(new File("~/docs/some.pdf"))
        .textFile(new File("~/docs/some.txt"))
        .build();
$ ./pdftotext "~/docs/some.pdf" "~/docs/some.txt"

If you plan to read the output text file at runtime and do not care about saving the text file, then you may exclude this field from your PdfTextRequest. A text file will be automatically initialized for you in your Java temp directory and deleted when your JVM terminates.

PdfTextRequest.builder()
        .pdfFile(new File("~/docs/some.pdf"))
        .build();
$ ./pdftotext "~/docs/some.pdf" "/tmp/03cb3e01-f281-4cd1-8ae3-210ae6076afa.txt"

PdfTextOptions

PdfTextOptions represents a set of command options accepted by pdftotext that will customize its execution.

Suppose you have a PDF file that is UTF-8 encoded and has tabulated data. Encoding is something you should definitely tell pdftotext about. How the output text should be laid out for you is more of an opinionated matter, however.

PdfTextOptions options = PdfTextOptions.builder()
        .encoding(PdfTextEncoding.UTF_8)
        .format(PdfTextFormat.TABLE)
        .build();

PdfTextRequest request = PdfTextRequest.builder()
        .pdfFile(new File("~/docs/some.pdf"))
        .textFile(new File("~/docs/some.txt"))
        .options(options)
        .build();
$ ./pdftotext -enc "UTF-8" -table "~/docs/some.pdf" "~/docs/some.txt"

We provide a mechanism for you to manually inject options into a command. We have implemented many (but not all) of the options specified in the pdftotext source documentation, so this is helpful for including options not implemented by PdfTextOptions. But you can do this for any option, implemented or unimplemented.

Important: No validation is performed on options entered this way - they will be injected directly into the shell command, as is. Also be aware that you may inadvertently duplicate an option in the shell command if you both manually inject it and assign a value to the PdfTextOptions implementation of that option.

PdfTextOptions.builder()
        .pageStart(1)
        .pageStop(5)
        .nativeOptions(Map.ofEntries(
                Map.entry("-enc", "UTF-8"),
                Map.entry("-table", null),
                Map.entry("-opw", "Secret123")))
        .build();
$ ./pdftotext -f "1" -l "5" -enc "UTF-8" -table -opw "Secret123" "~/docs/some.pdf" "~/docs/some.txt"

PdfTextResponse

PdfTextResponse represents the result of invoking pdftotext.

It will include the text file created from a PDF, as well as any standard output that may have been captured from the shell process.

Logging and Debugging

We have added an SLF4J logger to our PdfTextTool, leaving its implementation up to you.

We provide meaningful debug logs for anyone needing more detail. If you want the trace from pdftotext itself, then inject the "-verbose" command option into PdfTextOptions and inspect the standard output on your PdfTextResponse.

Building from Source

You do not need to build this project locally to use Xpdf API (packages are available in the Maven Central Repository).

But if you wish to build anyway, all you need is JDK 8 and our provided Maven wrapper.

$ ./mvnw install -DskipTests

Getting Help

Join our Discord and post a message in the #help channel for quick feedback with any issues you may have.

Reporting Bugs

If you find a bug, please visit our GitHub Issues page and open a new issue.

If you find a security vulnerability, please navigate to our Security Policy for instructions on how to privately report it.

License

Xpdf API is Open Source software released under the GNU General Public License, version 3 (GPLv3) only.