You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In its current state, the getRawText() function as well as the getTextAsMiniOcr() in the Alto.php don't catch full words hidden in edge cases like hyphenated word parts and abbreviations. The Xpath (in getRawText()) currently catches only all @content attributes, which is not enough. The same applies for the routines in getTextAsMiniOcr() or getWords respectivly.
getRawText() only extracts "Durch" "messer)," "De" "putation,"
the SOLR index might dismiss some wordparts, like "durch" or "de", wrongfully thinking those are stopwords
we now have not the full potential for a fulltext search, even though the ocr engine realized those are parts of one hyphenated word, because we wont index "Durchmesser" or "Deputation" but only "messer" or "putation"
Expected Behavior
When extracting words, we should check, wether those are parts of a hyphenated word, like SUBS_TYPE="HypPart1" SUBS_CONTENT="Deputation," or (even though i have not seen in the wild yet) Abbreviations (SUBS_TYPE="Abbreviation" SUBS_CONTENT="Abkürzung") and take the SUBS_CONTENT (for hyphenated words ony once). Otherwise we proceed to extract the content of the CONTENT attributes.
The easiest would have been (at least for getRawText()) to change the XPath to account for it, like: $words = $xml->xpath('./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[@SUBS_TYPE="HypPart1" or @SUBS_TYPE="Abbreviation"]/@SUBS_CONTENT | ./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[not(@SUBS_TYPE)]/@CONTENT ');
But sadly this is not possible, because we do not have any XPath 2.0 support and cant facilitate those XPath 2.0 functions (boolean operators, union (via |). See also: #823
Solving it through other means would likely be not that trivial.
Screenshots and Examples
Environment
does not apply
Additional Context
I am not sure how this might interfere with the ocr-highlighting...with the new ocr highlighter plugin parsing the xml directly it could be accounted for, if it checks SUBS_CONTENT as well.
The text was updated successfully, but these errors were encountered:
I don't see the need for XPath 2 here, just use .../@SUBS_CONTENT | .../@CONTENT.
But additional string processing (outside of XPath) might be useful for the case where no @SUBS_CONTENT is provided: concatenating both neighbouring String/@CONTENT, optionally downcasing.
Description
In its current state, the getRawText() function as well as the getTextAsMiniOcr() in the Alto.php don't catch full words hidden in edge cases like hyphenated word parts and abbreviations. The Xpath (in getRawText()) currently catches only all @content attributes, which is not enough. The same applies for the routines in getTextAsMiniOcr() or getWords respectivly.
Reproduction
Take an XML with hyphenated word parts, like:
https://img.sub.uni-hamburg.de/kitodo/PPN872169685_0021/00000106.xml
or
https://digital.slub-dresden.de/data/kitodo/sachubdiv_20028347Z_1845/sachubdiv_20028347Z_1845_ocr/00000116.xml
Expected Behavior
When extracting words, we should check, wether those are parts of a hyphenated word, like
SUBS_TYPE="HypPart1" SUBS_CONTENT="Deputation,"
or (even though i have not seen in the wild yet) Abbreviations (SUBS_TYPE="Abbreviation" SUBS_CONTENT="Abkürzung"
) and take the SUBS_CONTENT (for hyphenated words ony once). Otherwise we proceed to extract the content of the CONTENT attributes.The easiest would have been (at least for getRawText()) to change the XPath to account for it, like:
$words = $xml->xpath('./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[@SUBS_TYPE="HypPart1" or @SUBS_TYPE="Abbreviation"]/@SUBS_CONTENT | ./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[not(@SUBS_TYPE)]/@CONTENT ');
But sadly this is not possible, because we do not have any XPath 2.0 support and cant facilitate those XPath 2.0 functions (boolean operators, union (via |). See also: #823
Solving it through other means would likely be not that trivial.
Screenshots and Examples
Environment
does not apply
Additional Context
I am not sure how this might interfere with the ocr-highlighting...with the new ocr highlighter plugin parsing the xml directly it could be accounted for, if it checks SUBS_CONTENT as well.
The text was updated successfully, but these errors were encountered: