Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display and index hyphenated words as normal words #1009

Merged
merged 2 commits into from
Sep 28, 2023

Conversation

beatrycze-volk
Copy link
Collaborator

Fix for issue #824

As example is used term Bericht der Comminions. In the last line of the snippet Commision was not highlighted becasue of the hyphen. After fix that word is also highlighted. This doesn'T work fully on the image as highlight there comes from the solr-ocrhighlighting plugin.

Before:
miniocr

After:
miniocr2

This PR contains also workaround in full text for using the solr-ocrhighlighting plugin as it doesn't support hyphens in Mini OCR (dbmdz/solr-ocrhighlighting#35). The first part of the word is indexed as full word and the second part is indexed as space:

if (!empty($attributes['SUBS_CONTENT'])) {
    if ($attributes['SUBS_TYPE'] == 'HypPart1') {
        return htmlspecialchars((string) $attributes['SUBS_CONTENT']);
    }
    return ' ';
}

@sebastian-meyer sebastian-meyer merged commit 72e9635 into kitodo:master Sep 28, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug A non-security related bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incomplete word extraction from ALTO/XML hinders fulltext search/retrieval
2 participants