when the page number of the pdf file is bigger then 300, the extract_pages(f) is very slow #1071

tianyongliu · 2024-12-13T03:23:57Z

中国神华：中国神华2023年度报告.PDF

code is very simple

from pdfminer.high_level import extract_pages
file_path = "***.pdf"
with open (file_path, "rb") as f:
for page in extract_pages(f):
page_index = page.pageid
print(page_index)

it cost about 10 minutes.

dhdaines · 2024-12-13T17:38:11Z

Well, extract_pages is actually doing layout analysis on every page, so one might expect this to be slow! Do you find that it gets slower with each successive page?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when the page number of the pdf file is bigger then 300, the extract_pages(f) is very slow #1071

when the page number of the pdf file is bigger then 300, the extract_pages(f) is very slow #1071

tianyongliu commented Dec 13, 2024

dhdaines commented Dec 13, 2024

when the page number of the pdf file is bigger then 300, the extract_pages(f) is very slow #1071

when the page number of the pdf file is bigger then 300, the extract_pages(f) is very slow #1071

Comments

tianyongliu commented Dec 13, 2024

dhdaines commented Dec 13, 2024