-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CID characters when extracting text from Korean pdf #1035
Comments
Same problem as #1036 - again, try to copy and paste text out of the file and you will see that the mappings are just nonsense. |
Not sure what's the problem, I copied text from the pdf and it indeed returns squares, but then I tried the same pdf with llamaparse and it returns text as in the pdf itself, could it be something else? |
@nnurmano I guess its encoding issue |
Oh, it could be that pdfminer has an old or broken version of |
Also @nnurmano this is the first I have heard of llamaparse. It appears to maybe be proprietary? Do you know what they are actually using to extract text from PDF? |
No idea. But I shall try to find it out. |
Some more digging in that PDF - the Most of the text in the tables (on page 3 for instance) is actually in
Looking at this in FontForge, it appears to have a collection of precomposed Hangul blocks and jamo at specific code points, which can be assumed to mean something, just not what pdfminer.six, PDFium and Poppler expect them to mean ;-) Do these code points mean anything to you? |
Hi,
I am testing a PDF file and when I try to run it using pdfminer.xi characters are broken and my pdf is encoded with /UniKS-UTF16-H
This is the output coming
(cid:53)(cid:51)(cid:53)(cid:54)(cid:15434)(cid:4738)(cid:11182)(cid:6530)(cid:35) (cid:11206)(cid:9838)(cid:11542)(cid:35) (cid:9967)(cid:4794)(cid:8882)(cid:4766)(cid:9946)
(cid:977)(cid:20) (cid:20) (cid:1923)
Here is my enviorment or pdfminer version
Test PDF
2023_..9..-7-12.pdf
code
The text was updated successfully, but these errors were encountered: