PDF Text Reader

Dead simple PDF text reader.

Install

npm install pdf-text-reader

Usage

Read all pages into a single string with readPdfText:

import {readPdfText} from 'pdf-text-reader';

async function main() {
    const pdfText: string = await readPdfText({url: 'path/to/pdf/file.pdf'});
    console.info(pdfText);
}

main();

Read a PDF into individual pages with readPdfPages:

import {readPdfPages} from 'pdf-text-reader';

async function main() {
    const pages = await readPdfPages({url: 'path/to/pdf/file.pdf'});
    console.info(pages[0]?.lines);
}

main();

See the types for detailed argument and return value types.

Details

This uses Mozilla's pdf.js package through its pdfjs-dist distribution on npm.

This package simply reads the output of pdfjs.getDocument and sorts it into lines based on text position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.

Example:

The text below in a PDF will be read as having spaces in between them even if the space characters aren't in the PDF.

cell 1               cell 2                 cell 3

The number of spaces to insert is calculated by an extremely naive but very simple calculation of Math.ceil(distance-between-text/text-height).

Low Level Control

If you need lower level parsing control, you can also use the exported parsePageItems function. This only reads one page at a time as seen below. This function is used by readPdfPages so the output will be identical for the same pdf page.

You may need to independently install the pdfjs-dist npm package for this to work.

import * as pdfjs from 'pdfjs-dist';
import type {TextItem} from 'pdfjs-dist/types/src/display/api';
import {parsePageItems} from 'pdf-text-reader';

async function main() {
    const doc = await pdfjs.getDocument('myDocument.pdf').promise;
    const page = await doc.getPage(1); // 1-indexed
    const content = await page.getTextContent();
    const items: TextItem[] = content.items.filter((item): item is TextItem => 'str' in item);
    const parsedPage = parsePageItems(items);
    console.info(parsedPage.lines);
}

main();

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
.vscode		.vscode
configs		configs
src		src
test-files		test-files
.gitattributes		.gitattributes
.gitignore		.gitignore
.npmignore		.npmignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc.js		.prettierrc.js
LICENSE-CC0		LICENSE-CC0
LICENSE-MIT		LICENSE-MIT
README.md		README.md
cspell.config.js		cspell.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

License

Licenses found

electrovir/pdf-text-reader

Folders and files

Latest commit

History

Repository files navigation

PDF Text Reader

Install

Usage

Details

Low Level Control

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Languages