Skip to content

Latest commit

 

History

History
145 lines (108 loc) · 8.25 KB

README.md

File metadata and controls

145 lines (108 loc) · 8.25 KB

CompleteSearch

Build Status

CompleteSearch is a fast and interactive search engine for context-sensitive prefix search on a given collection of documents. It does not only provide search results, like a regular search engine, but also completions for the last (maybe only partially typed) query word that lead to a hit. This can be used to provide very efficient support for a variety of features: query autocompletion, faceted search, synonym search, error-tolerant search, semantic search. A list of publications on the techniques behind CompleteSearch and its many applications is provided at the end of this page.

For a demo on various datasets, just checkout this repository and follow the instructions below. With a single command line, you get a working demo (you can choose from several datasets, each of the size of a few million documents, so not paticularly large, but also not small). CompleteSearch scales to collections with tens or even hundreds of millions of documents, without losing its interactivity.

1. Checkout

Checkout the repository and build the docker image

git clone https://github.com/ad-freiburg/completesearch
cd completesearch
docker build -t completesearch .

2. Quickstart by Demo

The following command line builds a search index and then starts the search server for the dataset specified via the DB variable (the name of any subdirectory of applications works). Under the specified PORT you then have a generic UI, as well as an API (see Section 4 below).

    export DB=movies && PORT=1622 && docker run -it --rm -e DB=${DB} -p ${PORT}:8080 -v $(pwd)/applications:/applications -v $(pwd)/data/:/data -v $(pwd)/ui:/ui --name completesearch.${DB} completesearch -c "make DATA_DIR=/data/${DB} DB=${DB} csv pall start"

This command line downloads and uncompresses the CSV, builds the index, and starts the server, all in one go. If you have already downloaded the CSV, it will not be downloaded again (the Makefile target csv: then has no effect). If you have already built the index once, you can omit the Makefile target pall: (which stand for precompute all).

3. Relevant files

Read this section if you want to understand a little deeper of what's going on with the fancy command line above. The command line first builds a docker image from the code in this repository. So far so good. It then runs a docker container, which mounts three volumes, which we briefly explain next:

applications This folder contains the configuration for each application. Each configuration just contains two files. A Makefile that specifies how to build the index (this is highly customizable, see below). And a config.js for customizing the generic UI.

data This folder contains the CSV file with the original data (one record per line, in columns) and the index files. They all have a common prefix. See below for more information on the index.

ui This folder contains the code for the generic UI. If you just want to use CompleteSearch as backend and build your own UI, you don't have to mount this volume. It's nice, however, to always have a working UI available for testing, without any extra work.

4. The CompleteSearch index

Like all search engines, CompleteSearch builds an index with the help of which it can then answer queries efficiently. It is not an ordinary inverted index, but something more fancy: a half-inverted index or hybird (HYB) index. You don't have to understand this if you just want to use CompleteSearch. But if you are interested, you can learn more about it in the publications below.

To build the index, CompleteSearch requires two input files, one with suffix .words and one with suffix .docs. The first contains the contents of your documents split into words. The second contains the data that you want to display as search engine hits. The two are usually related, but not exactly the same. The format is very simple and is described by example here.

If you have special wishes, you can build these two input files yourself, from whatever your data is. Then you have full control over what CompleteSearch will and can do for you. However, in most applications, you can use our generic CSV parser. It takes a CSV file (one record per line, with a fixed number of columns per line) as input, and from that produce the .words and the .docs file.

The CSV parse is very powerful and highly customizable. You can see how it is used in the Makefile of the various example applications (in the subdirectories of the directory applications). A subset of the options is described in more detail here. For a complete list, look at the code that parses the options.

4. The CompleteSearch engine

The binary to start the CompleteSearch engine is called startCompletionServer. It is very powerful and has a lot of options. For some example uses, you can have a look at the Makefile in the director applications and at the included Makefile of one of the example applications. A detailed documentation of all the options can be found in the README.md in the src directory.

Once started, you can either ask queries using our generic and customizable UI (see above). Or you can ask the backend directly, via the HTTP API provided by startCompletionServer. The API is very simple and described at the end of this page. Play around with it for one the example applications to get a feeling for what it does. You can also look at the (rather simple) JavaScript code of the generic UI to get a feeling for how it works and what it can be used for.

5. (Optional) Setup a subdomain

To show off your CompleteSearch instance to your friends, you may want it to run under a fancy URL, and not http://my.weird.hostname.somewhere:76154. Let us assume you have an Apache webserver running on your machine. Then you can add the following section in your apache.conf or in a separte config file included by apache.conf. You have to replace servername by the fully qualified domain name (FQDN) of the machine on which your Apache webserver is running. You have to replace hostname by the FQDN of the machine on which the CompleteSearch frontend is running. This can be the same machine as servername, but does not have to be.

<VirtualHost *:80>
  ServerName example.cs.uni-freiburg.de
  ServerAlias dblp example.cs.uni-freiburg.de
  ServerAdmin webmaster@localhost

  ProxyPreserveHost On
  ProxyRequests Off

  ProxyPass / http://<hostname>:5000/
  ProxyPassReverse / http://<hostname>:5000>/

  ...
</VirtualHost>

6. Publications

Here are some of the publications explaining the techniques behind CompleteSearch and what it can be used for. This work was done at the Max-Planck-Institute for Informatics. It's already a while ago, but turns out that the features and the efficiency provided by CompleteSearch are still very much state of the art.

Type Less, Find More: Fast Autocompletion with a Succinct Index @ SIGIR 2006

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration @ CIDR 2007

ESTER: efficient search on text, entities, and relations @ SIGIR 2007

Efficient interactive query expansion with complete search @ CIKM 2007

Output-Sensitive Autocompletion Search @ Information Retrieval 2008

Semantic Full-Text Search with ESTER: Scalable, Easy, Fast @ ICDM 2008