Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: 'PDFObjRef' object is not iterable #1004

Open
corobin opened this issue Jul 10, 2024 · 5 comments · May be fixed by #1027
Open

TypeError: 'PDFObjRef' object is not iterable #1004

corobin opened this issue Jul 10, 2024 · 5 comments · May be fixed by #1027

Comments

@corobin
Copy link

corobin commented Jul 10, 2024

after updating to version 20240706 extract_text() on a pdf throws an error TypeError: 'PDFObjRef' object is not iterable

this did not occur on the previous version 20231228

Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> from pdfminer.high_level import extract_text
>>> text = extract_text("Working.pdf")
>>> text = extract_text("Error.pdf")
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    text = extract_text(path)
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 63, in __init__
    mediabox_params: List[Any] = [
TypeError: 'PDFObjRef' object is not iterable
>>>

Working.pdf - newly created blank page with acrobat

Error.pdf - downloaded, I cannot change the process of its creation. I deleted all visible text on the page which did not appear to affect the behaviour of the error

@felixxm
Copy link
Contributor

felixxm commented Jul 10, 2024

We hit the same issue with next(high_level.extract_pages(pdf_page_path)) calls.

@myhloli
Copy link

myhloli commented Jul 23, 2024

same error with this:opendatalab/MinerU#198

@dhdaines
Copy link
Contributor

Probably need to call resolve1 on self.attrs["MediaBox"] as well... it's indirect objects all the way down...

dhdaines added a commit to dhdaines/pdfminer.six that referenced this issue Jul 31, 2024
@dhdaines dhdaines linked a pull request Jul 31, 2024 that will close this issue
@MarcoPeli
Copy link

Probably need to call resolve1 on self.attrs["MediaBox"] as well... it's indirect objects all the way down...

I had same error, using resolve1 fixed it for me.

@jroakes
Copy link

jroakes commented Oct 26, 2024

@MarcoPeli . Y'all ok providing a bit more detail on the fix here, for users using:

from pdfminer.high_level import extract_text
text = extract_text("Working.pdf")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants