PDF with chisq fails with unclear error message due to extraction issue (improve message / use pdftools?) #84

LukasWallrich · 2023-05-14T13:47:33Z

fails with

Error in if (grepl(pattern = RGX_Q, x = test_raw)) { : 
  the condition has length > 1

This is because the chisq tests get read as follows:

a good model fit (2 (199) = 627.73, p < .001, CFI = .94, RMSEA = .07, SRMR = .05), and [...] loading on one factor (2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and the one-factor model with all items loading on one common factor (2 (209) = 4489.05, p < .001, CFI = .40, RMSEA = .20, SRMR = .17).

This is really odd xpdf-behaviour because I can copy-paste them from the PDF without trouble, so they seem to be embedded as characters rather than images.

So, two questions here:

can the error message be clearer? If it was sth like could not process "(2 (199) = 627.73" then trouble-shooting would be much easier?
is xpdf the best choice? I have very limited knowledge of this, but the "pdftools" package is much easier to install (just with install.packages, with no separate installation of dependencies) and gets this correct:

a good model fit (χ 2 (199) = 627.73, p < .001, CFI = .94,\nRMSEA = .07, SRMR = .05), and [...] loading on one factor (χ 2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and\nthe one-factor model with all items loading on one common factor (χ 2 (209) = 4489.05,\np < .001, CFI = .40, RMSEA = .20, SRMR = .17).

(Getting this to work requires two minor pre-processing steps:
pdftools::pdf_text(f) |> paste(collapse = "") |> gsub("\n", "", _) |> statcheck:::extract_stats("chisq")
)

The text was updated successfully, but these errors were encountered:

LukasWallrich · 2024-02-14T14:22:31Z

In a larger set of files, this issue occurred another 8 times (e.g., 10.1016/j.jbusres.2018.11.029 and 10.1002/smj.2976) - so it is reasonably common and resolves by using pdftools. Another issue that occurred is that xpdf inserts As into two tests in the attached PDF, which are then no longer found.
10.1016--j.obhdp.2013.04.003.pdf

However, pdftools failed to extract some tests (without causing errors) due to problems in reading multi-column layouts. Another tool that I tried (pdfminer.six) resolves that, but also fails with some chisq ... so for now, I would no longer recommend a change in the default PDF engine, but clearer error messages. Also, it might be worth recommending that users re-OCR PDFs that fail (or, if striving for completeness, all). For instance, statcheck does not find results in the following file due to issues with the =, but works after running ocrmypdf --force-ocr

10.1002--job.220.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF with chisq fails with unclear error message due to extraction issue (improve message / use pdftools?) #84

PDF with chisq fails with unclear error message due to extraction issue (improve message / use pdftools?) #84

LukasWallrich commented May 14, 2023

LukasWallrich commented Feb 14, 2024

PDF with chisq fails with unclear error message due to extraction issue (improve message / use pdftools?) #84

PDF with chisq fails with unclear error message due to extraction issue (improve message / use pdftools?) #84

Comments

LukasWallrich commented May 14, 2023

LukasWallrich commented Feb 14, 2024