Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval #824

Closed
michaelkubina opened this issue Jun 27, 2022 · 1 comment · Fixed by #1009
Closed

Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval #824

michaelkubina opened this issue Jun 27, 2022 · 1 comment · Fixed by #1009
Labels
🐛 bug A non-security related bug.

Comments

@michaelkubina
Copy link
Collaborator

michaelkubina commented Jun 27, 2022

Description

In its current state, the getRawText() function as well as the getTextAsMiniOcr() in the Alto.php don't catch full words hidden in edge cases like hyphenated word parts and abbreviations. The Xpath (in getRawText()) currently catches only all @content attributes, which is not enough. The same applies for the routines in getTextAsMiniOcr() or getWords respectivly.

Reproduction

Take an XML with hyphenated word parts, like:

https://img.sub.uni-hamburg.de/kitodo/PPN872169685_0021/00000106.xml

<String WC="0.8659999967" CONTENT="Durch" HEIGHT="29" WIDTH="110" VPOS="1974" HPOS="1941" SUBS_TYPE="HypPart1" SUBS_CONTENT="Durchmesser),"/>
<HYP CONTENT="­"/>
</TextLine>
<TextLine HEIGHT="60" WIDTH="1611" VPOS="2022" HPOS="457">
<String WC="0.8625000119" CONTENT="messer)," HEIGHT="39" WIDTH="142" VPOS="2042" HPOS="457" SUBS_TYPE="HypPart2" SUBS_CONTENT="Durchmesser),"/>

or

https://digital.slub-dresden.de/data/kitodo/sachubdiv_20028347Z_1845/sachubdiv_20028347Z_1845_ocr/00000116.xml

<String WC="0.63999998569488525" CONTENT="De" HEIGHT="34" WIDTH="43" VPOS="1630" HPOS="2146" SUBS_TYPE="HypPart1" SUBS_CONTENT="Deputation,"/>
<HYP CONTENT="­"/>
</TextLine>
<TextLine HEIGHT="175" WIDTH="1926" VPOS="1616" HPOS="278">
<String WC="0.56888890266418457" CONTENT="putation," HEIGHT="49" WIDTH="142" VPOS="1680" HPOS="278" SUBS_TYPE="HypPart2" SUBS_CONTENT="Deputation,"/>
  1. getRawText() only extracts "Durch" "messer)," "De" "putation,"
  2. the SOLR index might dismiss some wordparts, like "durch" or "de", wrongfully thinking those are stopwords
  3. we now have not the full potential for a fulltext search, even though the ocr engine realized those are parts of one hyphenated word, because we wont index "Durchmesser" or "Deputation" but only "messer" or "putation"

Expected Behavior

When extracting words, we should check, wether those are parts of a hyphenated word, like SUBS_TYPE="HypPart1" SUBS_CONTENT="Deputation," or (even though i have not seen in the wild yet) Abbreviations (SUBS_TYPE="Abbreviation" SUBS_CONTENT="Abkürzung") and take the SUBS_CONTENT (for hyphenated words ony once). Otherwise we proceed to extract the content of the CONTENT attributes.

The easiest would have been (at least for getRawText()) to change the XPath to account for it, like: $words = $xml->xpath('./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[@SUBS_TYPE="HypPart1" or @SUBS_TYPE="Abbreviation"]/@SUBS_CONTENT | ./alto:Layout/alto:Page/alto:PrintSpace//alto:TextBlock/alto:TextLine/alto:String[not(@SUBS_TYPE)]/@CONTENT ');

But sadly this is not possible, because we do not have any XPath 2.0 support and cant facilitate those XPath 2.0 functions (boolean operators, union (via |). See also: #823

Solving it through other means would likely be not that trivial.

Screenshots and Examples

Environment

does not apply

Additional Context

I am not sure how this might interfere with the ocr-highlighting...with the new ocr highlighter plugin parsing the xml directly it could be accounted for, if it checks SUBS_CONTENT as well.

@michaelkubina michaelkubina added the 🐛 bug A non-security related bug. label Jun 27, 2022
@bertsky
Copy link

bertsky commented Feb 17, 2023

I don't see the need for XPath 2 here, just use .../@SUBS_CONTENT | .../@CONTENT.

But additional string processing (outside of XPath) might be useful for the case where no @SUBS_CONTENT is provided: concatenating both neighbouring String/@CONTENT, optionally downcasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug A non-security related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants