Bug: PDF file upload failed - Could not initialize tesseract #614

azaylamba · 2024-12-04T13:47:36Z

I am getting the following error while uploading certain PDF files. This is reproducible every time with some PDF files.

Working fine for most of the PDF files.

Starting file converter batch job
Workspace ID: d951f6fb-f8c0-4fa6-ad64-3d3a243154df
Document ID: 31c07ab2-434d-4bb1-b156-a90ee161010c
Input bucket name: devchatbotstack-ragenginesdataimportupload-6qhws4pdvker
Input object key: d951f6fb-f8c0-4fa6-ad64-3d3a243154df/Introducing NitroX.pdf
Output bucket name: devchatbotstack-ragenginesdataimportproces-ptrkl9g0s1v7
Output object key: d951f6fb-f8c0-4fa6-ad64-3d3a243154df/31c07ab2-434d-4bb1-b156-a90ee161010c/content.txt
loader: <langchain_community.document_loaders.s3_file.S3FileLoader object at 0x7fce8b60a110>
(1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
Traceback (most recent call last):
  File "/app/main.py", line 81, in <module>
    main()
  File "/app/main.py", line 64, in main
    raise error
  File "/app/main.py", line 49, in main
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_community/document_loaders/unstructured.py", line 107, in lazy_load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/langchain_community/document_loaders/s3_file.py", line 135, in _get_elements
    return partition(filename=file_path, **self.unstructured_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/auto.py", line 341, in partition
    elements = partition_pdf(
               ^^^^^^^^^^^^^^
  File "/app/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 706, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 662, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 210, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 346, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 899, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 933, in _partition_pdf_or_image_with_ocr_from_image
    ocr_data = ocr_agent.get_layout_elements_from_image(image=image)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 217, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 96, in get_layout_elements_from_image
    ocr_regions = self.get_layout_from_image(image)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 50, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 596, in image_to_data
    return {
           ^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 598, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
                              ^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 573, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 352, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py", line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

Sample file to reproduce the issue
FileUploadErrorSample.pdf

The text was updated successfully, but these errors were encountered:

azaylamba · 2024-12-04T14:01:48Z

One observation is that the issue seems to be with the PDF files generated via print function on Windows system. The PDF producer is Microsoft: Print to PDF for the files where I am getting the issue.

charles-marion · 2024-12-04T15:27:51Z

@azaylamba ,

It looks like it needs the training data to convert these files.

Removing this line might fix the problem but the docker image will be bigger (and processing slower). Note it's not the same folder.

https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile#L5

An alternative solution listed here would be to run Unstructured-IO/unstructured#3290 (comment) apk add tesseract-eng in the docker file (but it seems resolved, maybe it's using an older base image?)

azaylamba · 2024-12-05T11:21:07Z

Sample file to reproduce the issue
FileUploadErrorSample.pdf

azaylamba · 2024-12-05T11:29:15Z

@charles-marion I tried with the latest version 0.16.9 of unstructured but the issue still persisted.

Issue is resolved after adding RUN apk add --no-cache tesseract-eng in https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile

So it seems tesseract-eng is required to process such PDF files.

azaylamba · 2024-12-05T11:35:58Z

@charles-marion Please let me know if you think this is the correct approach to fix this and you want me to raise a PR.

github-project-automation bot added this to AWS GenAI Chatbot Dec 4, 2024

azaylamba mentioned this issue Dec 7, 2024

bug: PDF file upload failed - Could not initialize tesseract #618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: PDF file upload failed - Could not initialize tesseract #614

Bug: PDF file upload failed - Could not initialize tesseract #614

azaylamba commented Dec 4, 2024 •

edited

Loading

azaylamba commented Dec 4, 2024

charles-marion commented Dec 4, 2024 •

edited

Loading

azaylamba commented Dec 5, 2024

azaylamba commented Dec 5, 2024

azaylamba commented Dec 5, 2024

Bug: PDF file upload failed - Could not initialize tesseract #614

Bug: PDF file upload failed - Could not initialize tesseract #614

Comments

azaylamba commented Dec 4, 2024 • edited Loading

azaylamba commented Dec 4, 2024

charles-marion commented Dec 4, 2024 • edited Loading

azaylamba commented Dec 5, 2024

azaylamba commented Dec 5, 2024

azaylamba commented Dec 5, 2024

azaylamba commented Dec 4, 2024 •

edited

Loading

charles-marion commented Dec 4, 2024 •

edited

Loading