-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clipping paths implementation #414
Comments
Hi @kelvin0, are you experiencing problems due to this issue? I assume that the clipping operator is more often used to exclude parts of a drawing, than it being used to exclude part of the text. Anyway, it would be nice to have a pdf to test this on. If you want to start implementing this, have a look at section 4.4.3 of the pdf reference manual. You should also adjust the |
Clipping path is indeed used to hide text in PDF documents, here is an example that could be used as a starting point: https://mva.maryland.gov/Documents/VR-181.pdf There is hidden text slightly above the "VR-181 (03-18)". I was able to extract it properly with pdfbox, but not with pdfminer as path clipping is not supported. |
Feel free to create a PR. I can do reviews and merge it when ready. I don't mind if the first implementation only focusses on adding clipping-path behaviour and ignoring additional top-level arguments for enabling/disabling the behavior. We can create another issue for that, if needed. |
@kelvin0 Just a quick bump on this issue as we're trying to sort through them. Are you still willing to work on this? As commented above, a PR would be appreciated if you're still interested and able to. |
Hi! I just ran into this issue as well. It specifically seems fairly common to use the clipping path to hide text in legal documents (academic documents often use the more prosaic method of setting the text colour to white). You can see this pretty clearly in https://www.legisquebec.gouv.qc.ca/fr/pdf/lc/C-1.pdf - on the first page (PDF object 5) there is a bunch of hidden text. The way the formatter in question (Antenna House 6.3) renders text is somewhat annoying to follow, but it appears that it simply sets the clipping path to something arbitrary which excludes the text in question, for example, at the top of page 1, the hidden text "CADASTRE":
Implementing the winding rules seems rather complicated though there are plenty of implementations out there that can serve as a reference. |
Thanks for the example! In this case the clipping path is a simple rectangle and all the hidden text is placed outside that rectangle. My first idea is to make a PR that minimally supports these two examples by deriving a visible rectangle from the clipping path and intersecting it with the bbox of characters when they are added to the layout - at that point the converter (or another library like pdfplumber) can call is_empty() on them to decide if they should be shown or not. Edit: That seems like not such a great idea, actually, since objects that are clipped out are not in the layout by definition. If you want to get at them you could use the interpreter directly. |
Hi Everyone,
I've been using Pdfminer for the last few months, I really thing it's a very helpful codebase.
But recently I noticed that clipping paths do not seem to be implemented, I inspected:
\pdfminer\pdfinterp.py
The effect of this is that ALL text is extracted from the PDF, even text that should not be visible (since it should be clipped).
I am not a PDF expert but I can surely help implement the following features:
Hope I can clarify this and be able to contribute to the project if necessary.
The text was updated successfully, but these errors were encountered: