Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

dhdaines · 2024-12-13T17:33:46Z

In theory pdfminer.six has a CMapParse which is capable of parsing embedded CMaps defined in the Encoding field of a Type0 font specification.

In practice, it doesn't do that at all... it only parses ToUnicode CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=code

This is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. So pdfminer.six is not able to get the right widths, etc, for characters in PDFs that use these because it cannot map them to any CIDs.

There is a more visible problem, which is that it is also unable to extract any text from them. This is because its handling of ToUnicode CMaps is actually entirely incorrect (and unfortunately PLAYA has inherited this, which I am in the process of fixing at the moment).

Specifically, pdfminer.six assumes that the mapping from a byte sequence in an object stream to a Unicode string goes like this:

b'ABC' => [cid(A), cid(B), cid(C)] => ["A", "B", "C"]

This is incorrect. Instead, ToUnicode is intended to map byte sequences directly to Unicode characters, so:

b'ABC' => ["A", "B", "C"]

The Encoding CMap (which could be an embedded one as noted above) does a separate mapping of byte sequences to CIDs which has nothing to do with text extraction. This only happens to work most of the time in pdfminer.six because either there is no CMap, or the CMap is an identity CMap, so the input bytes and the CIDs are the same, or one of the predefined Unicode CMaps is used (see below).

Here are some samples from pdf.js that illustrate the problem (pdfminer.six cannot extract text from them):

https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue2931.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue9534_reduced.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue18117.pdf

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-12-15T21:27:50Z

The pdf.js code is really quite clear for this.

From an input byte string, first it reads variable-width character codes according to the ranges defined in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3454
The CID (called widthCode here but it's the CID) is looked up in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3350
The Unicode string representation is looked up in the ToUnicode map: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3363

And then some other stuff happens ;-) but the important point here is that Encoding and ToUnicode maps, while they both have the form of CMaps, are really totally separate and different things.

dhdaines · 2024-12-15T23:29:54Z

The source of the confusion here is due to the special case (represented by pdfminer/cmap/to-unicode-*) of Unicode conversion for predefined CMaps. This is indeed done by mapping the CID to a "Unicode value" (presumably a code point in UCS-2) using a special CMap.

But this particular CMap is not a ToUnicode map, it is simply a special CMap whose CID values can be interpreted as Unicode code points. See PDF 1.7 section 9.10.2.

This logic is implemented in conformance with the PDF 1.7 specification in pdf.js here: https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L3796

dhdaines changed the title ~~Encoding CMaps are not actually parsed~~ Embedded CMaps are not actually parsed Dec 15, 2024

dhdaines changed the title ~~Embedded CMaps are not actually parsed~~ Embedded CMaps are not actually parsed, and character codes are not mapped Dec 15, 2024

dhdaines changed the title ~~Embedded CMaps are not actually parsed, and character codes are not mapped~~ Character and Unicode mapping is incorrect for CID fonts with embeded CMaps Dec 15, 2024

dhdaines mentioned this issue Dec 15, 2024

ToUnicode maps should map character codes, not CIDs dhdaines/playa#28

Open

dhdaines mentioned this issue Dec 16, 2024

Use PLAYA instead of pdfminer jsvine/pdfplumber#1226

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

dhdaines commented Dec 13, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Comments

dhdaines commented Dec 13, 2024 • edited Loading

dhdaines commented Dec 15, 2024 • edited Loading

dhdaines commented Dec 15, 2024 • edited Loading

dhdaines commented Dec 13, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading

dhdaines commented Dec 15, 2024 •

edited

Loading