-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: PDF file upload failed - Could not initialize tesseract #614
Comments
One observation is that the issue seems to be with the PDF files generated via print function on Windows system. The PDF producer is |
It looks like it needs the training data to convert these files. Removing this line might fix the problem but the docker image will be bigger (and processing slower). Note it's not the same folder. https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile#L5 An alternative solution listed here would be to run Unstructured-IO/unstructured#3290 (comment) |
Sample file to reproduce the issue |
@charles-marion I tried with the latest version Issue is resolved after adding So it seems tesseract-eng is required to process such PDF files. |
@charles-marion Please let me know if you think this is the correct approach to fix this and you want me to raise a PR. |
I am getting the following error while uploading certain PDF files. This is reproducible every time with some PDF files.
Working fine for most of the PDF files.
Sample file to reproduce the issue
FileUploadErrorSample.pdf
The text was updated successfully, but these errors were encountered: