Skip to content

RyanQuey/es-index-onedrive

Repository files navigation

Apache Tika integration built in scala for indexing OneDrive files into ElasticSearch.

Why build this tool?

Because Windows search functionality just doesn't cut it.

Use Cases

  • Primarily, for indexing all my notes that I keep in OneDrive
  • Also can use to index files stored in external hard drives

How does it work?

Well...right now it doesn't. It's a work in progress. But the idea is that Apache Tika can parse my .doc, .pdf, .docx, .pptx, .ppt, and even OneNote files so that they are machine readable. Then I will use an elasticsearch java client to index all these files.

For a better user experience, I can then make an easy GUI (probably a simple web server on localhost, using play framework? Maybe Electron? React Native?). Ideally it would be able to run on Windows 10 (unfortunately a requirement, but this is the primary use case and rationale for building in the first place) and indexing often enough that it finds changes made in last couple days.

Honestly, Windows 10 finder search works well enough (if you use content:"your phrase") to get by for a quick filename search or minor content search, but bugs out too often, and doesn't get consistent results. This tool is for when you want to find all docs that mention a topic, and you want it to work reliably and not miss anything. So indexing even once a day is probably sufficient.

Setup

Install SBT

(see sbt instructions)

Start ElasticSearch

docker-compose up -d

Compile and run

sbt compile

And then:

sbt run

Run test script to see if tika is working

Just a quick script to get up and running, as a POC

./tika-test.sh

Should get result of something like this:

screenshot

Feature Ideas

  • Create different indexes for complex searches
  • Create sample queries for commonly used use cases that would be difficult to do in Windows search or OneDrive search by default using their GUI
  • Can programmatically access and scrape my OneDrive files and use perhaps in other programs, such as my intertextuality graph project.
  • Add other integrations, for adding more features. For example, can use Rsync to copy OneDrive files into linux, since linux doesn't have a client for it. Then, can index within a linux box if needed.
  • Improved GUI over the Windows 10 finder search, with expandable results for each file, so I can quickly see if the text that returned a hit is relevant to what I'm actually looking for.
  • Find a way to search several powerpoint presentations, show the text that matches within the context of the slide the matching text was found in (currently not possible without doing your own coding)

Why not just use something like Agent Ransack

https://www.mythicsoft.com/agentransack/information/#features

Hmm...Actually maybe you should...

Credits

About

Apache Tika integration built in scala for indexing OneDrive files into ElasticSearch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published