-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚀 Feature: OCR #924
Comments
I am unable to optimize the tool and make a git pull request. The function worked on my computer, but very slowly. If anyone can take on this improvement, I would be grateful. I believe it will be a substantial optimization of the tool, not only for me but for several other usage scenarios. |
Appreciate your try @Fagner-lourenco |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
🔖 Feature description
I have a suggestion to enable PDF file ingestion with OCR. I am studying the project to use in the legal field. However, many documents are non-searchable text in images, requiring OCR processing to extract the text. In this case, if the number of characters extracted is less than X, it triggers OCR.
🎤 Why is this feature needed ?
I wrote this code, but I am an amateur. I did not consider the issue of speed and performance. It would be interesting if you analyzed and implemented these functionalities in an optimized way to not affect performance. In this case, I thought of a code that checks if the standard text extraction has fewer than X characters. If it does, it means that there is likely an image on that page, triggering the OCR. Does it make sense?
✌️ How do you aim to achieve this?
docs_parser.py
from pathlib import Path
from typing import Dict
from application.parser.file.base_parser import BaseParser
import fitz # PyMuPDF
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
class PDFParser(BaseParser):
"""PDF parser with optional OCR support."""
🔄️ Additional Information
No response
👀 Have you spent some time to check if this feature request has been raised before?
Are you willing to submit PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered: