-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character and Unicode mapping is incorrect for CID fonts with embeded CMaps #1072
Comments
The pdf.js code is really quite clear for this.
And then some other stuff happens ;-) but the important point here is that |
The source of the confusion here is due to the special case (represented by But this particular CMap is not a This logic is implemented in conformance with the PDF 1.7 specification in pdf.js here: https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L3796 |
In theory
pdfminer.six
has aCMapParse
which is capable of parsing embedded CMaps defined in theEncoding
field of a Type0 font specification.In practice, it doesn't do that at all... it only parses
ToUnicode
CMaps: https://github.com/search?q=repo%3Apdfminer%2Fpdfminer.six%20CMapParser&type=codeThis is a problem because some PDFs will actually define their own, more exotic mappings of byte strings to CIDs in Type0 fonts. So
pdfminer.six
is not able to get the right widths, etc, for characters in PDFs that use these because it cannot map them to any CIDs.There is a more visible problem, which is that it is also unable to extract any text from them. This is because its handling of
ToUnicode
CMaps is actually entirely incorrect (and unfortunately PLAYA has inherited this, which I am in the process of fixing at the moment).Specifically,
pdfminer.six
assumes that the mapping from a byte sequence in an object stream to a Unicode string goes like this:This is incorrect. Instead,
ToUnicode
is intended to map byte sequences directly to Unicode characters, so:The
Encoding
CMap (which could be an embedded one as noted above) does a separate mapping of byte sequences to CIDs which has nothing to do with text extraction. This only happens to work most of the time inpdfminer.six
because either there is no CMap, or the CMap is an identity CMap, so the input bytes and the CIDs are the same, or one of the predefined Unicode CMaps is used (see below).Here are some samples from pdf.js that illustrate the problem (pdfminer.six cannot extract text from them):
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue2931.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue7901.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue9534_reduced.pdf
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue18117.pdf
The text was updated successfully, but these errors were encountered: