Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when the page number of the pdf file is bigger then 300, the extract_pages(f) is very slow #1071

Open
tianyongliu opened this issue Dec 13, 2024 · 1 comment

Comments

@tianyongliu
Copy link

中国神华:中国神华2023年度报告.PDF

code is very simple

from pdfminer.high_level import extract_pages
file_path = "***.pdf"
with open (file_path, "rb") as f:
for page in extract_pages(f):
page_index = page.pageid
print(page_index)

it cost about 10 minutes.

@dhdaines
Copy link
Contributor

Well, extract_pages is actually doing layout analysis on every page, so one might expect this to be slow! Do you find that it gets slower with each successive page?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants