Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072

Open
dhdaines opened this issue Dec 13, 2024 · 2 comments
Open

Comments

@dhdaines
Copy link
Contributor

dhdaines commented Dec 13, 2024

In theory pdfminer.six has a CMapParse which is capable of parsing embedded CMaps defined in the Encoding field of a Type0 font specification.

In practice, it doesn't do that at all... it only parses ToUnicode CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=code

This is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. So pdfminer.six is not able to get the right widths, etc, for characters in PDFs that use these because it cannot map them to any CIDs.

There is a more visible problem, which is that it is also unable to extract any text from them. This is because its handling of ToUnicode CMaps is actually entirely incorrect (and unfortunately PLAYA has inherited this, which I am in the process of fixing at the moment).

Specifically, pdfminer.six assumes that the mapping from a byte sequence in an object stream to a Unicode string goes like this:

b'ABC' => [cid(A), cid(B), cid(C)] => ["A", "B", "C"]

This is incorrect. Instead, ToUnicode is intended to map byte sequences directly to Unicode characters, so:

b'ABC' => ["A", "B", "C"]

The Encoding CMap (which could be an embedded one as noted above) does a separate mapping of byte sequences to CIDs which has nothing to do with text extraction. This only happens to work most of the time in pdfminer.six because either there is no CMap, or the CMap is an identity CMap, so the input bytes and the CIDs are the same, or one of the predefined Unicode CMaps is used (see below).

Here are some samples from pdf.js that illustrate the problem (pdfminer.six cannot extract text from them):

https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue2931.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue9534_reduced.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue18117.pdf

@dhdaines dhdaines changed the title Encoding CMaps are not actually parsed Embedded CMaps are not actually parsed Dec 15, 2024
@dhdaines dhdaines changed the title Embedded CMaps are not actually parsed Embedded CMaps are not actually parsed, and character codes are not mapped Dec 15, 2024
@dhdaines dhdaines changed the title Embedded CMaps are not actually parsed, and character codes are not mapped Character and Unicode mapping is incorrect for CID fonts with embeded CMaps Dec 15, 2024
@dhdaines
Copy link
Contributor Author

dhdaines commented Dec 15, 2024

The pdf.js code is really quite clear for this.

  1. From an input byte string, first it reads variable-width character codes according to the ranges defined in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3454
  2. The CID (called widthCode here but it's the CID) is looked up in the CMap: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3350
  3. The Unicode string representation is looked up in the ToUnicode map: https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3363

And then some other stuff happens ;-) but the important point here is that Encoding and ToUnicode maps, while they both have the form of CMaps, are really totally separate and different things.

@dhdaines
Copy link
Contributor Author

dhdaines commented Dec 15, 2024

The source of the confusion here is due to the special case (represented by pdfminer/cmap/to-unicode-*) of Unicode conversion for predefined CMaps. This is indeed done by mapping the CID to a "Unicode value" (presumably a code point in UCS-2) using a special CMap.

But this particular CMap is not a ToUnicode map, it is simply a special CMap whose CID values can be interpreted as Unicode code points. See PDF 1.7 section 9.10.2.

This logic is implemented in conformance with the PDF 1.7 specification in pdf.js here: https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L3796

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant