You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While the immediate font fallback solution addresses the Cyrillic character issue, there's potential for improving the overall text extraction quality:
The extraction pipeline could be enhanced to:
Extract/OCR first page only
Detect language
Re-run extraction/OCR with language-specific optimizations
This would improve accuracy for documents in various languages, but requires more significant changes to the extraction pipeline and additional processing time. Could be considered as a separate enhancement.
Description
PDFs containing Cyrillic text (without embedded fonts) show corrupted characters after processing due to Ghostscript's limited fallback font support.
Visual Evidence
Before processing:
After processing:
Anonymized Test PDF:
input.pdf
Technical Details
Ghostscript currently uses a fallback font with limited character support:
/usr/share/ghostscript/10.02.1/Resource/CIDFSubst/DroidSansFallback.ttf
Solution
Use Noto Sans as fallback font for better Unicode coverage:
Environment
The text was updated successfully, but these errors were encountered: