Skip to content

Latest commit

 

History

History
200 lines (142 loc) · 8.66 KB

README.md

File metadata and controls

200 lines (142 loc) · 8.66 KB

Easysearch

A simple way to add search to your website, featuring:

  • Automated crawling and indexing of your site
  • Scheduled content refreshes
  • Sitemap scanning
  • An API for search results
  • Multi-tenancy
  • Vector similarity search, allowing search by semantic meaning instead of exact matches (what is this?)
  • A prebuilt search page that works without JavaScript (with progressive enhancement for users with JS enabled)

This project is built with Go and requires CGo due to the SQLite dependency.

Why?

I wanted to add a search function to my website. When I researched my available options, I found that:

  • Google Programmable Search Engine (formerly Custom Search) wasn't customizable with the prebuilt widget and costs $5 per 1000 queries via the JSON API. It also only includes pages that are indexed by Google (obviously), so my results would be incomplete.
  • Algolia is a fully-managed SaaS that you can't self-host. While its results are incredibly good, they could change their offerings or pricing model at any time. Also, crawling is an additional cost.
  • A custom search solution that uses my data would return the best quality results, but it takes time to build for each site and must be updated whenever my schema changes.

This is a FOSS alternative to the aforementioned products that addresses my primary pain points. It's simple, runs anywhere, and lets you own your data.

Alternatives

  • If you have a static site, check out Pagefind. It runs search on the client-side and builds an index whenever you generate your site.
  • For very small, personal sites, check out Kagi Sidekick when it launches.
  • If you're a nonprofit, school, or government agency, you can disable ads on your Google Programmable Search Engine. See this article for more info.

To-do list

  • Basic canonicalization
  • Build a common representation of query features (like AND, OR, exact matches, negation, fuzzy matches) and using it to build queries for the user's database driver
  • Implementing something like Readability (or at least removing the contents of non-text resources)
  • SPA support using a headless browser
  • Guarantee that pages in the queue are only crawled once, even in distributed scenarios
  • Prebuilt components for React, Vue, Svelte, etc.
  • Exponential backoff for crawl errors
  • Vector search
  • Generating and indexing transcripts of video and audio recordings?
  • Image search?

Configuration

Easysearch requires a config file located at ./config.yml in the current working directory.

See the example in config-sample.yml for more information.

Development

  1. Clone the repository:
git clone https://github.com/FluxCapacitor2/easysearch
  1. Run the app locally:
go run --tags="fts5" ./app

If you are using VS Code, you can press F5 to run the project automatically.

Note: You have to add the fts5 Go build tag to enable full-text search with the fts5 extension, which Easysearch requires. See this section of the go-sqlite3 README for more info.

  1. Make sure you have a sqlite3 development package installed on your system that provides sqlite3.h. For example:

    • Fedora/RHEL: dnf install libsqlite3x-devel
    • Ubuntu/Debian: apt install libsqlite3-dev

    If your system does not provide such a package, you can run Go builds in a Docker container and use one of the commands above to install the required package.

For automatic code formatting, make sure you have Node.js installed. Then, install Prettier via NPM:

npm install

You can now format the HTML template using prettier -w . or enable the recommended VS Code extension to format whenever you save. This will also install a Git hook that formats Go and Go template files before committing.

For Go source files, instead of Prettier, use go fmt. You can format the whole source tree with go fmt ./app/....

Building and Running an Executable

You can build a binary with this command:

go build --tags "fts5" -o easysearch ./app

Then, you can run it like this:

$ ./easysearch

If you're on Windows, the file name would be easysearch.exe instead of easysearch.

Building and Running with Docker

You can build an Easysearch Docker image with this command:

docker build . -t ghcr.io/fluxcapacitor2/easysearch:test

Then, to run it, use this:

docker run -p 8080:8080 -v ./config.yml:/var/run/easysearch/config.yml ghcr.io/fluxcapacitor2/easysearch:test

To use the latest version from the main branch of this repository, you can run:

docker run -p 8080:8080 -v ./config.yml:/var/run/easysearch/config.yml ghcr.io/fluxcapacitor2/easysearch:main

This port-forwards port 8080 and mounts config.yml from your current working directory into the container.

The image is built automatically with a GitHub Actions workflow, so it's always up-to-date.

Search Results API

When you start up Easysearch, the API server address is printed to the process's standard output.

To get search results, make a GET request to the /search endpoint with the following URL parameters:

  • source: The ID of your search source. This must match the value of one of the id properties in your config.yml file.
  • q: Your search query.
  • page: The page number, starting at 1. Each page will contain 10 results.

For example:

GET http://localhost:8080/search?source=brendan&q=typescript

The response is returned as a JSON object.

{
  "success": true,
  "results": [
    {
      "url": "https://www.bswanson.dev/blog/nextauth-oauth-passing-errors-to-the-client/",
      "title": [
        {
          "highlighted": false,
          "content": "Passing user-friendly NextAuth v5 error messages to the client"
        }
      ],
      "description": [
        { "highlighted": false, "content": "In Auth.js v5, you can only pass…" }
      ],
      "content": [
        { "highlighted": false, "content": "…First, if you’re using " },
        { "highlighted": true, "content": "TypeScript" },
        {
          "highlighted": false,
          "content": ", augment the JWT and Session interfaces:\nsrc/auth.ts// This can be anything, just make sure the same…"
        }
      ],
      "rank": -3.657958588047788
    }
  ],
  "pagination": { "page": 1, "pageSize": 10, "total": 1 },
  "responseTime": 0.000778
}
  • results: The list of search results
    • url: The canonical URL of the matching page
    • title: A snippet of the page title, taken from the <title> HTML tag
    • description: A snippet of the page's meta description, taken from the <meta name="description"> HTML tag
    • content: A snippet of the page's text content. Text is parsed using go-readability by default. If Readability doesn't find an article, text is taken from all elements except those on this list.
    • rank: The relative ranking of the item. Lower numbers indicate greater relevance to the search query.
  • pagination:
    • page: The page specified in the request.
    • pageSize: The maximum amount of items returned. Currently, this value is always 10.
    • total: The total amount of results that match the query. The amount of pages can be computed by dividing the total by the pageSize.
  • responseTime: The amount of time, in seconds, that it took to process the request.

title, description, and content are arrays. If an item is highlighted, then it directly matches the query. This allows you to bold relevant keywords in search results when building a user interface.

If there was an error processing the request, the response will look like this:

{ "success": false, "error": "Internal server error" }

Error messages are intentionally vague to obscure details about your environment or database schema. However, full errors are printed to the process's standard output.