Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF with chisq fails with unclear error message due to extraction issue (improve message / use pdftools?) #84

Open
LukasWallrich opened this issue May 14, 2023 · 1 comment

Comments

@LukasWallrich
Copy link

This PDF file
10.1111:apps.12362.pdf

fails with

Error in if (grepl(pattern = RGX_Q, x = test_raw)) { : 
  the condition has length > 1

This is because the chisq tests get read as follows:

a good model fit (2 (199) = 627.73, p < .001, CFI = .94, RMSEA = .07, SRMR = .05), and [...] loading on one factor (2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and the one-factor model with all items loading on one common factor (2 (209) = 4489.05, p < .001, CFI = .40, RMSEA = .20, SRMR = .17).

This is really odd xpdf-behaviour because I can copy-paste them from the PDF without trouble, so they seem to be embedded as characters rather than images.

So, two questions here:

  • can the error message be clearer? If it was sth like could not process "(2 (199) = 627.73" then trouble-shooting would be much easier?
  • is xpdf the best choice? I have very limited knowledge of this, but the "pdftools" package is much easier to install (just with install.packages, with no separate installation of dependencies) and gets this correct:

a good model fit (χ 2 (199) = 627.73, p < .001, CFI = .94,\nRMSEA = .07, SRMR = .05), and [...] loading on one factor (χ 2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and\nthe one-factor model with all items loading on one common factor (χ 2 (209) = 4489.05,\np < .001, CFI = .40, RMSEA = .20, SRMR = .17).

(Getting this to work requires two minor pre-processing steps:
pdftools::pdf_text(f) |> paste(collapse = "") |> gsub("\n", "", _) |> statcheck:::extract_stats("chisq")
)

@LukasWallrich
Copy link
Author

In a larger set of files, this issue occurred another 8 times (e.g., 10.1016/j.jbusres.2018.11.029 and 10.1002/smj.2976) - so it is reasonably common and resolves by using pdftools. Another issue that occurred is that xpdf inserts As into two tests in the attached PDF, which are then no longer found.
10.1016--j.obhdp.2013.04.003.pdf

However, pdftools failed to extract some tests (without causing errors) due to problems in reading multi-column layouts. Another tool that I tried (pdfminer.six) resolves that, but also fails with some chisq ... so for now, I would no longer recommend a change in the default PDF engine, but clearer error messages. Also, it might be worth recommending that users re-OCR PDFs that fail (or, if striving for completeness, all). For instance, statcheck does not find results in the following file due to issues with the =, but works after running ocrmypdf --force-ocr

10.1002--job.220.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant