Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crop: removes original image ref #201

Open
bertsky opened this issue Feb 29, 2024 · 7 comments
Open

crop: removes original image ref #201

bertsky opened this issue Feb 29, 2024 · 7 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Feb 29, 2024

I tried to run the selected_page_ocr workflow on reichsanzeiger_random_selected_pages_ocr (removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE to start with) and encountered a different problem (using latest ocrd/all:maximum image):

15:51:21.928 INFO ocrd.task_sequence.run_tasks - Start processing task 'skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -p '{"level-of-operation": "page", "dpi": 0, "protect": 0.0, "maxsize": 1.0}''
15:51:24.932 INFO processor.SkimageDenoise - INPUT FILE 0 / P_1879_45_0344
15:51:31.599 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-skimage-denoise'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 130, in run_processor
    processor.process()
  File "/build/ocrd_wrap/ocrd_wrap/skimage_denoise.py", line 75, in process
    page_image, page_coords, page_image_info = self.workspace.image_from_page(
  File "/usr/local/lib/python3.8/site-packages/ocrd/workspace.py", line 781, in image_from_page
    raise Exception('Found no AlternativeImage that satisfies all requirements ' +
Exception: Found no AlternativeImage that satisfies all requirements selector="binarized" in page "P_1879_45_0344"

Workspace at this point - if someone wants to have a look: https://qurator-data.de/~mike.gerber/2024-02-quiver-benchmarks-issue-22/reichsanzeiger_random_selected_pages_ocr.zip (Includes a ocrd.log)

Originally posted by @mikegerber in OCR-D/quiver-benchmarks#22 (comment)

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 29, 2024

@mikegerber
Thanks for digging and sharing the data! Indeed, that's a bug (if a different one). Something really weird is going on during ocrd-tesserocr-crop: It removes the binarized image as AlternativeImage, and replaces the original @imageFilename with the binarized image...

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 29, 2024

Looking deeper, the most astonishing fact about this workspace is that – somehow – your METS now contains a broken (invalid) physical structMap, which repeats the page ID divs across fptrs (instead of subsuming all fptrs under one div):

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
        <mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
 ...
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
      </mets:div>
...
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-CROP_1879_45_0344.IMG-BIN.IMG-CROP"/>
      </mets:div>
...
     <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-BIN2_1879_45_0344.IMG-BIN.IMG-CROP.IMG-BIN"/>
      </mets:div>

Could this be related to OCR-D/quiver-benchmarks#29?

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 29, 2024

Sorry, I cannot reproduce how you got here. The broken METS is definitely the reason for the cropper to misbehave, which in turn is the reason for second binarizer to fail. But if I manually do the prepare_reichsanzeiger_sets.sh for the reichsanzeiger_random.list, I get a correct METS (and correct workflow results).

@mikegerber
Copy link
Contributor

mikegerber commented Mar 1, 2024

The workspace was originally from https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces, I just remove-grouped everything except images and GT and then tried to reproduce @stweil's latest problem by running the selected_page_ocr workflow (see OCR-D/quiver-benchmarks#22 for the details).

This looks like, at some point, the physical structMap seems to have been corrupted and this led to this problem. Because I also got a similar problem with add (see OCR-D/core#1179), I would say there's a problem in core/ocrd workspace.

The original METS (https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/mets.xml) looks OK.

@mikegerber
Copy link
Contributor

(I'll investigate and open an issue at core for this)

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 29, 2024

@mikegerber I think we established that the root cause was an outdated OCR-D version in Quiver, which had a bug that produced broken METS prior to this step. Is that correct? Can we close then?

@mikegerber
Copy link
Contributor

@mikegerber I think we established that the root cause was an outdated OCR-D version in Quiver, which had a bug that produced broken METS prior to this step. Is that correct? Can we close then?

Yes, this was "the METS caching bug".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants