-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crop: removes original image ref #201
Comments
@mikegerber |
Looking deeper, the most astonishing fact about this workspace is that – somehow – your METS now contains a broken (invalid) physical structMap, which repeats the page ID divs across fptrs (instead of subsuming all fptrs under one div): <mets:structMap TYPE="PHYSICAL">
<mets:div TYPE="physSequence">
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
<mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
...
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
</mets:div>
...
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-CROP_1879_45_0344.IMG-BIN.IMG-CROP"/>
</mets:div>
...
<mets:div TYPE="page" ID="P_1879_45_0344">
<mets:fptr FILEID="OCR-D-BIN2_1879_45_0344.IMG-BIN.IMG-CROP.IMG-BIN"/>
</mets:div> Could this be related to OCR-D/quiver-benchmarks#29? |
Sorry, I cannot reproduce how you got here. The broken METS is definitely the reason for the cropper to misbehave, which in turn is the reason for second binarizer to fail. But if I manually do the prepare_reichsanzeiger_sets.sh for the |
The workspace was originally from https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces, I just This looks like, at some point, the physical The original METS (https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/mets.xml) looks OK. |
(I'll investigate and open an issue at core for this) |
@mikegerber I think we established that the root cause was an outdated OCR-D version in Quiver, which had a bug that produced broken METS prior to this step. Is that correct? Can we close then? |
Yes, this was "the METS caching bug". |
I tried to run the
selected_page_ocr
workflow onreichsanzeiger_random_selected_pages_ocr
(removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE to start with) and encountered a different problem (using latest ocrd/all:maximum image):Workspace at this point - if someone wants to have a look: https://qurator-data.de/~mike.gerber/2024-02-quiver-benchmarks-issue-22/reichsanzeiger_random_selected_pages_ocr.zip (Includes a
ocrd.log
)Originally posted by @mikegerber in OCR-D/quiver-benchmarks#22 (comment)
The text was updated successfully, but these errors were encountered: