Skip to content

TransparencyToolkit/OCRServer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This is the software for the Transparency Toolkit OCR server. It receives data from the document upload form, OCRs the documents, and saves the results.

  1. Install the following packages:
  • graphicsmagick
  • poppler-data
  • poppler-utils
  • ghostscript
  • tesseract-ocr
  • pdftk
  • libreoffice
  • openjdk-8-jdk
  • openjdk-8-jre
  • libcurl3
  • libcurl3-gnutls
  • libcurl4-openssl-dev
  • libmagic-dev
  • unoconv
  1. Install the following gems:
  • mimemagic
  • docsplit
  • curb
  • ruby filemagic
  • pry
  • mail
  • listen
  • rubyzip
  1. Install Apache Tika server by downloading the .jar from https://tika.apache.org/download.html

  2. Start Tika by running: java -jar tika-server-1.18.jar

  3. Optionally: Install ABBYY. This is not free software, but has higher quality OCR for some file types. Images and image-style PDFs as well as office documents that fail OCR with Tika will default to using ABBYY if it is installed. A license for the command line version can be purchased at https://www.ocr4linux.com/en:pricing:start. The OCR server will default to Tesseract if ABBYY is not installed.

  4. Setup and start https://github.com/TransparencyToolkit/DocUpload Documents need to be uploaded for the rest to work.

  5. Set the following environment variables:

  • OCR_IN_PATH: The path for documents and metadata to input
  • OCR_OUT_PATH: The path for documents to output
  • PROJECT_INDEX: The index name in elastic.
  1. In this directory (for the OCRServer), run: ruby run_ocr.rb It will then listen for new documents in OCR_IN. This must be started BEFORE documents are uploaded.

Note: To run on an existing directory, set inotify_works = false in input_output/load_files.rb

About

OCR server for hosted archiving service

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages