Cyrillic characters mangled in PDF text extraction due to limited Ghostscript font fallback #2921

tiborrr · 2025-01-15T16:35:50Z

Description

PDFs containing Cyrillic text (without embedded fonts) show corrupted characters after processing due to Ghostscript's limited fallback font support.

Visual Evidence

Before processing:

After processing:

Anonymized Test PDF:
input.pdf

Technical Details

Ghostscript currently uses a fallback font with limited character support:
/usr/share/ghostscript/10.02.1/Resource/CIDFSubst/DroidSansFallback.ttf

Solution

Use Noto Sans as fallback font for better Unicode coverage:

Download and install the font:

wget https://github.com/notofonts/notofonts.github.io/raw/refs/heads/main/fonts/NotoSans/hinted/ttf/NotoSans-Regular.ttf
sudo mkdir -p /usr/share/fonts/truetype/noto
sudo mv NotoSans-Regular.ttf /usr/share/fonts/truetype/noto/
sudo fc-cache -f -v

Configure Ghostscript:

# In /usr/share/ghostscript/10.02.1/Resource/Init/cidfmap
/CIDFallBack (/usr/share/fonts/truetype/noto/NotoSans-Regular.ttf) ;

Environment

Operating System: Linux
Ghostscript version: 10.02.1

The text was updated successfully, but these errors were encountered:

tiborrr · 2025-01-15T16:37:25Z

Future Enhancement Note

While the immediate font fallback solution addresses the Cyrillic character issue, there's potential for improving the overall text extraction quality:

The extraction pipeline could be enhanced to:

Extract/OCR first page only
Detect language
Re-run extraction/OCR with language-specific optimizations

This would improve accuracy for documents in various languages, but requires more significant changes to the extraction pipeline and additional processing time. Could be considered as a separate enhancement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyrillic characters mangled in PDF text extraction due to limited Ghostscript font fallback #2921

Cyrillic characters mangled in PDF text extraction due to limited Ghostscript font fallback #2921

tiborrr commented Jan 15, 2025 •

edited

Loading

tiborrr commented Jan 15, 2025

Cyrillic characters mangled in PDF text extraction due to limited Ghostscript font fallback #2921

Cyrillic characters mangled in PDF text extraction due to limited Ghostscript font fallback #2921

Comments

tiborrr commented Jan 15, 2025 • edited Loading

Description

Visual Evidence

Technical Details

Solution

Environment

tiborrr commented Jan 15, 2025

Future Enhancement Note

tiborrr commented Jan 15, 2025 •

edited

Loading