Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyrillic characters mangled in PDF text extraction due to limited Ghostscript font fallback #2921

Open
tiborrr opened this issue Jan 15, 2025 · 1 comment

Comments

@tiborrr
Copy link
Contributor

tiborrr commented Jan 15, 2025

Description

PDFs containing Cyrillic text (without embedded fonts) show corrupted characters after processing due to Ghostscript's limited fallback font support.

Visual Evidence

Before processing:
image

After processing:
image

Anonymized Test PDF:
input.pdf

Technical Details

Ghostscript currently uses a fallback font with limited character support:
/usr/share/ghostscript/10.02.1/Resource/CIDFSubst/DroidSansFallback.ttf

Solution

Use Noto Sans as fallback font for better Unicode coverage:

  1. Download and install the font:
wget https://github.com/notofonts/notofonts.github.io/raw/refs/heads/main/fonts/NotoSans/hinted/ttf/NotoSans-Regular.ttf
sudo mkdir -p /usr/share/fonts/truetype/noto
sudo mv NotoSans-Regular.ttf /usr/share/fonts/truetype/noto/
sudo fc-cache -f -v
  1. Configure Ghostscript:
# In /usr/share/ghostscript/10.02.1/Resource/Init/cidfmap
/CIDFallBack (/usr/share/fonts/truetype/noto/NotoSans-Regular.ttf) ;

Environment

  • Operating System: Linux
  • Ghostscript version: 10.02.1
@tiborrr
Copy link
Contributor Author

tiborrr commented Jan 15, 2025

Future Enhancement Note

While the immediate font fallback solution addresses the Cyrillic character issue, there's potential for improving the overall text extraction quality:

The extraction pipeline could be enhanced to:

  1. Extract/OCR first page only
  2. Detect language
  3. Re-run extraction/OCR with language-specific optimizations

This would improve accuracy for documents in various languages, but requires more significant changes to the extraction pipeline and additional processing time. Could be considered as a separate enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant