Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refactoring all PDF loader and parser: community
Description: refactoring of PDF parsers and loaders. See below
Issue: missing locks, parameter inconsistency, missing lazy approach, split loader and parser, etc.
Twitter handle: pprados
Add tests and docs:
docs/docs/integrations
directoryLint and test: done
Rational
Even though
Document
has apage_content
parameter (rather than text or body), we believe it’s not good practice to work with pages. Indeed, this approach creates memory gaps in RAG projects. If a paragraph spans two pages, the beginning of the paragraph is at the end of one page, while the rest is at the start of the next. With a page-based approach, there will be two separate chunks, each containing part of a sentence. The corresponding vectors won’t be relevant. These chunks are unlikely to be selected when there’s a question specifically about the split paragraph. If one of the chunks is selected, there’s little chance the LLM can answer the question. This issue is worsened by the injection of headers, footers (if parsers haven’t properly removed them), images, or tables at the end of a page, as most current implementations tend to do.Why is it important to unify the different parsers? Each has its own characteristics and strategies, more or less effective depending on the family of PDF files. One strategy is to identify the family of the PDF file (by inspecting the metadata or the content of the first page) and then select the most efficient parser in that case. By unifying parsers, the following code doesn't need to deal with the specifics of different parsers, as the result is similar for each. We'll propose a Parser using this strategy in another PR.
The PR
We propose a substantial PR to improve the different PDF parser integrations. All my clients struggle with PDFs. I took the initiative to address this issue at its root by refactoring the various integrations of Python PDF parsers. The goal is to standardize a minimum set of parameters and metadata and bring improvements to each one (bug fixes, feature additions).
Don't worry about the size of the PR. In the end, there are only two modified files. The rest is just updating unit tests and docs.
langchain_community/document_loaders/pdf.py
-
langchain_community/document_loaders/parsers/pdf.py
langchain_community/tests/integration_tests/document_loaders/pdf.py
-
langchain_community/tests/integration_tests/document_loaders/parsers/pdf.py
docs/docs/integrations/document_loaders/*pdf*.ipynb
docs/docs/how_to/document_loader_pdf.ipynb
docs/docs/how_to/document_loader_custom.ipynb
In order to qualify all the code, we worked in a separate project, using the
langchain-common
structure. In this way, we can compare the results of the historical implementation with the new ones.We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the
langchain-common
structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here.The only difference is the name to import classes.
All this // project is available here. Consult the
compare_old_new
directory with your development environment, using DIFF to identify differences.Metadata
All parsers use lowercase keys for pdf file metadata. Except
PDFPlumberParser
. For this particular case, we've added a dictionary wrapper that warns when keys with upper case letters are used.Images
The current implementation in LangChain involves asking each parser for the text on a page, then retrieving images to apply OCR. The text extracted from images is then appended to the end of the page text, which may split paragraphs across pages, worsening the RAG model’s performance.
To avoid this, we modified the strategy for injecting OCR results from images. Now, the result is inserted between two paragraphs of text (
\n\n
or\n
), just before the end of the page. This allows a half-paragraph to be combined with the first paragraph of the following page.Currently, the LangChain implementation uses RapidOCR to analyze images and extract any text. This algorithm is designed to work with Chinese and English, not other languages. Since the implementation uses a function rather than a method, it’s not possible to modify it. We have modified the various parsers to allow for selecting the algorithm to analyze images. Now, it’s possible to use RapidOCR, Tesseract, or invoke a multimodal LLM to get a description of the image.
To standardize this, we propose a new abstract class:
For converting images to text, the possible formats are: text, markdown, and HTML. Why is this important? If it’s necessary to split a result, based on the origin of the text fragments, it’s possible to do so at the level of image translations. An identification rule such as
![text](...)
or<img …/>
allows us to identify text fragments originating from an image.Tables
Tables present in PDF files are another challenge. Some algorithms can detect part of them. This typically involves a specialized process, separate from the text flow. That is, the text extracted from the page includes each cell's content, sometimes in columns, sometimes in rows. This text is challenging for the LLM to interpret. Depending on the capabilities of the libraries, it may be possible to detect tables, then identify the cell boxes during text extraction to inject the table in its entirety. This way, the flow remains coherent. It’s even possible to add a few paragraphs before and after the table to prompt an LLM to describe it. Only the description of the table will be used for embedding.
Tables identified in PDF pages can be translated into markdown (if there are no merged cells) or HTML (which consumes more tokens). LLMs can then make use of them.
Unfortunately, this approach isn’t always feasible. In such cases, we can apply the approach used for images, by injecting tables and images between two paragraphs in the page’s text flow. This is always better than placing them at the end of the page.
Combining Pages
As mentioned, in a RAG project, we want to work with the text flow of a document, rather than by page. A mode is dedicated to this, which can be configured to specify the character to use for page delimiters in the flow. This could simply be
\n
,------\n
or\f
to clearly indicate a page change, or<!-- PAGE BREAK -->
for seamless injection in a Markdown viewer without a visual effect.Why is it important to identify page breaks when retrieving the full document flow? Because we generally want to provide a URL with the chunk’s location when the LLM answers. While it’s possible to reference the entire PDF, this isn’t practical if it’s more than two pages long. It’s better to indicate the specific page to display in the URL. Therefore, assistance is needed so that chunking algorithms can add the page metadata to each chunk. The choice of delimiter helps the algorithm calculate this parameter.
Similarly, we’ve added metadata in all parsers with the total number of pages in the document. Why is this important? If we want to reference a document, we need to determine if it’s relevant. A reference is valid if it helps the user quickly locate the fragment within the document (using the page and/or a chunk excerpt). But if the URL points to a PDF file without a page number (for various reasons) and the file has a large number of pages, we want to remove the reference that doesn’t assist the user. There’s no point in referencing a 100-page document! The
total_pages
metadata can then be used. We recommend this approach in an extension to LangChain that we propose for managing document references: langchain-reference.Compatibility
We have tried, as much as possible, to maintain compatibility with the previous version. This is reflected in preserving the order of parameters and using the default values for each implementation so that the results remain similar. The unit and integration tests for the various parsers have not been modified; they are still valid.
Ideally, we would prefer an interface like:
but this could break compatibility for positional arguments.
Perhaps it would be feasible to plan a migration for LangChain v1.0 by modifying the default parameters to make them mandatory during the transition to v1.0. At that point, we could reintroduce default values.
Normalisation
The
AzureAIDocumentIntelligenceParser
class introduces themode
parameter, which accepts the valuessingle
,page
, andmarkdown
.The deprecated
UnstructuredPDFLoader
class introduces themode
parameter, which accepts the valuessingle
,paged
, andmarkdown
.Based on this model, we are extending the presence of the mode parameter to most parsers, with the value
single
,page
, andmarkdown
.paged
is declared depreciated.The different
Loader
andBlobParser
classes now offer the following parameters:file_path
str
orPurePath
with the file name.password
str with the file password, if needed.mode
to return a single document per file or one document per page (extended withelements
in the case of Unstructured or other specific parser).pages_delimiter
to specify how to join pages (\f
by default).extract_images
to enable image extraction (already present in most Loaders/Parsers).images_to_text
to specify how to handle images (invoking OCR, LLM, etc.).extract_tables
to allow extraction of tables detected by underlying libraries, for certain parsers.The integration of image texts is now between two paragraphs.
For the
images_to_text
parameter, we propose three functions:convert_images_to_text_with_rapidocr()
convert_images_to_text_with_tesseract()
convert_images_to_description()
Here’s how it’s used:
Tables
Some parsers are able to extract arrays, but this is not integrated into langchain. We've added the necessary features to take this into account.
PyMuPDFLoader
PDFPlumberLoader
ZeroxPDFLoader
UnstructuredPDFLoader
Metadata
The different parsers offer a minimum set of common metadata:
source
page
total_page
creationdate
creator
producer
All keys are in lowercase.
Tests
We propose matrix tests to validate all parsers compatible with the new approach.
test_standard_parameters()
test_parser_with_table()
To validate all the parsers, we retrieved all the PDF files used by each parser for its own tests, and invoked all the parsers from langchain, along with all these files. This ensures that there are no crashes when parsing a PDF file.
New features of parsers
We resume the modification for each parsers
New loader / parsers
New parsers will be introduced in a separate pull request.
UnstructuredPDF
LlamaIndexPDF
PyMuPDF4LLM
PDFRouter
DoclingPDF
PDFMulti
For example, with the unification of parsers, it will be possible to choose the parser according to the characteristics of the PDF file.
This will be present in other PRs.