-
-
Notifications
You must be signed in to change notification settings - Fork 680
Life of a Segment
While the life of a segment will be unchanged, some of the finer detail of the segment state are implemented implicitely in tantivy 0.6. In that sense the document describes the refactoring of #325.
A tantivy index consists in a set of smaller immutable independent index called segment. Adding new documents and committing does not modify the existing segments, but adds fresh segments to the segment list.
A merging mechanic can then ensure that we do not have an explosion of the number of segments : Tantivy can decide to initiate a merge of N small segments into a larger one. When the merge successfully terminates, the files of the N smaller segments are useless. These file are guaranteed to be eventually deleted by a file garbage collector.
Tantivy does not work like a regular database in the sense that documents are not available right after being added. It does not really enforce a notion of transaction either.
Instead, the user is in charge of batching its add
and delete
operation in batches and explicitely .commit()
them.
Once a commit is successful, tantivy guarantees that all the operations previous to the .commit()
are reflected in search, and persisted.
In case of a hardware or software failure, upon restart, the index is in the state of the last .commit()
.[^1]
Deletes is a bit of an exception to the immutability of segments.
Deleting documents in an existing segment works by creating a tombstone file that stores a bitset of the DocIds
that have been deleted. None of the previous segment files are modified. The previous delete tombstone file is not modified either. Instead, a segment can have more than one tombstone associated. Each of them is associated to a specific commit opstamp.
tantivy needs to keep track of the file that it creates to be able to remove them on garbage collection.
For that, it relies on a wrapper of a Directory
and keeps track of the list of the created files BEFORE creating them.
The .managed_files.json
contains the files that, -if they exists[^2]- and -if they are identified as needed by tantivy- should be deleted upon garbage collection.
A living segment can be in the following state.
- Construction
- Uncommitted
- UncommittedInMerge
- Committed
- CommittedInMerge
[^1] Synchronisation with an external process (for instance when indexing logs) can be made by adding a payload to the .commit()
or conversely by bookeeping the notion of opstamp.
[^2] They are not guaranteed to exists. For instance, a failure right after an update of the managed.json
file can leave the index in state where a managed file has been deleted or has never been created. This is not a problem. We want to guard ourselves from the reciprocal : An file that has been created by tantivy and still exists on the filesystem should always be listed in the managed.json file.