title | authors | |
---|---|---|
Crawl and index files, file folders or file servers |
|
How to index files like Word documents, PDF files and whole document folders to Apache Solr or Elastic Search?
This connector and command line tools crawl and index directories and files from your filesystem and index it to Apache Solr or Elastic Search for full text search and text mining.
If you use Linux that means you can crawl whatever is mountable to Linux into an Apache Solr or Elastic Search index or into a triplestore.
This can be a hard disk or:
- partitions formatted with fat, ext3, ext4
- a file server connected via ntfs
- file shares like smb or even sshfs or sftp on servers
- private file sharing services like Seafile or OwnCloud on your own servers
- Dropbox, Amazon or other storage services in the cloud.
This connector integrates enhanced data enrichment and data analysis plugins like automatic text recognition (OCR) for images and photos (i.e. as files like PNG, JPG, GIF ...) or inside PDFs (i.e.scanned Documents) using Tesseract OCR.
Index a file or directory:
Using the web admin interface
- Open the page Files
- Enter filename to the form
- Press button "crawl"
Using the command line interface (CLI):
opensemanticsearch-index-file *filename*
Using the REST-API:
http://127.0.0.1/search-apps/api/index-file?uri=*/home/opensemanticsearch/readme.txt*
Config file for indexing files: */etc/opensemanticsearch/connector-files*