Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval #824

michaelkubina · 2022-06-27T16:00:02Z

Description

In its current state, the getRawText() function as well as the getTextAsMiniOcr() in the Alto.php don't catch full words hidden in edge cases like hyphenated word parts and abbreviations. The Xpath (in getRawText()) currently catches only all @content attributes, which is not enough. The same applies for the routines in getTextAsMiniOcr() or getWords respectivly.

Reproduction

Take an XML with hyphenated word parts, like:

https://img.sub.uni-hamburg.de/kitodo/PPN872169685_0021/00000106.xml

<String WC="0.8659999967" CONTENT="Durch" HEIGHT="29" WIDTH="110" VPOS="1974" HPOS="1941" SUBS_TYPE="HypPart1" SUBS_CONTENT="Durchmesser),"/>
<HYP CONTENT=""/>
</TextLine>
<TextLine HEIGHT="60" WIDTH="1611" VPOS="2022" HPOS="457">
<String WC="0.8625000119" CONTENT="messer)," HEIGHT="39" WIDTH="142" VPOS="2042" HPOS="457" SUBS_TYPE="HypPart2" SUBS_CONTENT="Durchmesser),"/>

or

https://digital.slub-dresden.de/data/kitodo/sachubdiv_20028347Z_1845/sachubdiv_20028347Z_1845_ocr/00000116.xml

<String WC="0.63999998569488525" CONTENT="De" HEIGHT="34" WIDTH="43" VPOS="1630" HPOS="2146" SUBS_TYPE="HypPart1" SUBS_CONTENT="Deputation,"/>
<HYP CONTENT=""/>
</TextLine>
<TextLine HEIGHT="175" WIDTH="1926" VPOS="1616" HPOS="278">
<String WC="0.56888890266418457" CONTENT="putation," HEIGHT="49" WIDTH="142" VPOS="1680" HPOS="278" SUBS_TYPE="HypPart2" SUBS_CONTENT="Deputation,"/>

getRawText() only extracts "Durch" "messer)," "De" "putation,"
the SOLR index might dismiss some wordparts, like "durch" or "de", wrongfully thinking those are stopwords
we now have not the full potential for a fulltext search, even though the ocr engine realized those are parts of one hyphenated word, because we wont index "Durchmesser" or "Deputation" but only "messer" or "putation"

Expected Behavior

When extracting words, we should check, wether those are parts of a hyphenated word, like SUBS_TYPE="HypPart1" SUBS_CONTENT="Deputation," or (even though i have not seen in the wild yet) Abbreviations (SUBS_TYPE="Abbreviation" SUBS_CONTENT="Abkürzung") and take the SUBS_CONTENT (for hyphenated words ony once). Otherwise we proceed to extract the content of the CONTENT attributes.

The easiest would have been (at least for getRawText()) to change the XPath to account for it, like: $words = $xml->xpath('./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[@SUBS_TYPE="HypPart1" or @SUBS_TYPE="Abbreviation"]/@SUBS_CONTENT | ./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[not(@SUBS_TYPE)]/@CONTENT ');

But sadly this is not possible, because we do not have any XPath 2.0 support and cant facilitate those XPath 2.0 functions (boolean operators, union (via |). See also: #823

Solving it through other means would likely be not that trivial.

Screenshots and Examples

Environment

does not apply

Additional Context

I am not sure how this might interfere with the ocr-highlighting...with the new ocr highlighter plugin parsing the xml directly it could be accounted for, if it checks SUBS_CONTENT as well.

The text was updated successfully, but these errors were encountered:

bertsky · 2023-02-17T16:29:39Z

I don't see the need for XPath 2 here, just use .../@SUBS_CONTENT | .../@CONTENT.

But additional string processing (outside of XPath) might be useful for the case where no @SUBS_CONTENT is provided: concatenating both neighbouring String/@CONTENT, optionally downcasing.

michaelkubina added the 🐛 bug A non-security related bug. label Jun 27, 2022

beatrycze-volk mentioned this issue Sep 11, 2023

Display and index hyphenated words as normal words #1009

Merged

sebastian-meyer linked a pull request Sep 28, 2023 that will close this issue

Display and index hyphenated words as normal words #1009

Merged

sebastian-meyer closed this as completed in #1009 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval #824

Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval #824

michaelkubina commented Jun 27, 2022 •

edited

Loading

bertsky commented Feb 17, 2023

Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval #824

Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval #824

Comments

michaelkubina commented Jun 27, 2022 • edited Loading

Description

Reproduction

Expected Behavior

Screenshots and Examples

Environment

Additional Context

bertsky commented Feb 17, 2023

michaelkubina commented Jun 27, 2022 •

edited

Loading