Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: PDF file upload failed - Could not initialize tesseract #614

Open
azaylamba opened this issue Dec 4, 2024 · 5 comments
Open

Bug: PDF file upload failed - Could not initialize tesseract #614

azaylamba opened this issue Dec 4, 2024 · 5 comments

Comments

@azaylamba
Copy link
Contributor

azaylamba commented Dec 4, 2024

I am getting the following error while uploading certain PDF files. This is reproducible every time with some PDF files.

Working fine for most of the PDF files.

Starting file converter batch job
Workspace ID: d951f6fb-f8c0-4fa6-ad64-3d3a243154df
Document ID: 31c07ab2-434d-4bb1-b156-a90ee161010c
Input bucket name: devchatbotstack-ragenginesdataimportupload-6qhws4pdvker
Input object key: d951f6fb-f8c0-4fa6-ad64-3d3a243154df/Introducing NitroX.pdf
Output bucket name: devchatbotstack-ragenginesdataimportproces-ptrkl9g0s1v7
Output object key: d951f6fb-f8c0-4fa6-ad64-3d3a243154df/31c07ab2-434d-4bb1-b156-a90ee161010c/content.txt
loader: <langchain_community.document_loaders.s3_file.S3FileLoader object at 0x7fce8b60a110>
(1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
Traceback (most recent call last):
  File "/app/main.py", line 81, in <module>
    main()
  File "/app/main.py", line 64, in main
    raise error
  File "/app/main.py", line 49, in main
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 107, in lazy_load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_community/document_loaders/s3_file.py", line 135, in _get_elements
    return partition(filename=file_path, **self.unstructured_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/auto.py", line 341, in partition
    elements = partition_pdf(
               ^^^^^^^^^^^^^^
  File "/app/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 706, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 662, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 210, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 346, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 899, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 933, in _partition_pdf_or_image_with_ocr_from_image
    ocr_data = ocr_agent.get_layout_elements_from_image(image=image)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 217, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 96, in get_layout_elements_from_image
    ocr_regions = self.get_layout_from_image(image)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 50, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 596, in image_to_data
    return {
           ^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 598, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
                              ^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 573, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 352, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

Sample file to reproduce the issue
FileUploadErrorSample.pdf

@azaylamba
Copy link
Contributor Author

One observation is that the issue seems to be with the PDF files generated via print function on Windows system. The PDF producer is Microsoft: Print to PDF for the files where I am getting the issue.

@charles-marion
Copy link
Collaborator

charles-marion commented Dec 4, 2024

@azaylamba ,

It looks like it needs the training data to convert these files.

Removing this line might fix the problem but the docker image will be bigger (and processing slower). Note it's not the same folder.

https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile#L5

An alternative solution listed here would be to run Unstructured-IO/unstructured#3290 (comment) apk add tesseract-eng in the docker file (but it seems resolved, maybe it's using an older base image?)

@azaylamba
Copy link
Contributor Author

Sample file to reproduce the issue
FileUploadErrorSample.pdf

@azaylamba
Copy link
Contributor Author

@charles-marion I tried with the latest version 0.16.9 of unstructured but the issue still persisted.

Issue is resolved after adding RUN apk add --no-cache tesseract-eng in https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile

So it seems tesseract-eng is required to process such PDF files.

@azaylamba
Copy link
Contributor Author

@charles-marion Please let me know if you think this is the correct approach to fix this and you want me to raise a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants