Skip to content

Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again

License

opensemanticsearch/tesseract-ocr-cache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tesseract OCR Cache

Tesseract OCR wrapper caching the OCR results, so Apache Tika-Server has not to reprocess slow and expensive OCR on same images again.

F.e. same images (logos or corporate identity elements), which appear in many PDF or Word documents.

Or for reindexing or new analysis of your documents because of changed ETL settings or new analysis features.

Therefore there is a tesseract wrapper which is called with same parameters like the original Tesseract command line interface (CLI):

tesseract_cache

The commandline tool tesseract_cache/tesseract calls Tesseract OCR and caches the results to a file directory before returning the resulting text.

If you OCR the same image again, it doesn't call Tesseract OCR again but returns the result text from the cache.

tesseract_fake

The commandline tool tesseract_fake/tesseract does not forward the call to Tesseract OCR.

It returns OCR results only if yet cached by former runs of tesseract_cache/tesseract.

If the image was not processed by OCR yet it will return only the string [Image (no OCR yet)].

Since OCR needs most resources for often a few additional information, this approach is used to index most document contents without expensive OCR processing to be able to search for most content much earlier.

By the OCR fake text or temporary status we get the info, if Apache Tika found some images in the document, so such documents are added to another task queue for expensive OCR with lower priority than the standard text extraction tasks.

Setup

Just set Apache Tika to use the command tesseract in directory tesseract_cache instead of the original tesseract binary directory.

About

Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages