Initial implementation of the metadata log #61

SeastarFS is a log-structured filesystem. Every shard will have 3 private logs: - metadata log - medium data log - big data log (this is not actually a log, but in the big picture it looks like it was) Disk space is divided into clusters (typically around several MiB) that have all equal size that is multiple of alignment (typically 4096 bytes). Each shard has its private pool of clusters (assignment is stored in bootstrap record). Each log consumes clusters one by one -- it writes the current one and if cluster becomes full, then log switches to a new one that is obtained from a pool of free clusters managed by cluster_allocator. Metadata log and medium data log write data in the same manner: they fill up the cluster gradually from left to right. Big data log takes a cluster and completely fills it with data at once -- it is only used during big writes. This commit adds the skeleton of the metadata log: - data structures for holding metadata in memory with all operations on this data structure i.e. manipulating files and their contents - locking logic (detailed description can be found in metadata_log.hh) - buffers for writting logs to disk (one for metadata and one for medium data) - basic higher level interface e.g. path lookup, iterating over directory - boostraping metadata log == reading metadata log from disk and reconstructing shard's filesystem structure from just before shutdown File content is stored as a set of data vectors that may have one of three kinds: in memory data, on disk data, hole. Small writes are writted directly to the metadata log and because all metadata is stored in the memory these writes are also in memory, therefore in-memory kind. Medium and large data are not stored in memory, so they are represented using on-disk kind. Enlarging file via truncate may produce holes, hence hole kind. Directory entries are stored as metadata log entries -- directory inodes have no content. To disk buffers buffer data that will be written to disk. There are two kinds: (normal) to disk buffer and metadata to disk buffer. The latter is implemented using the former, but provides higher level interface for appending metadata log entries rather than raw bytes. Normal to disk buffer appends data sequentially, but if a flush occurs the offset where next data will be appended is aligned up to alignment to ensure that writes to the same cluster are non-overlaping. Metadata to disk buffer appends data using normal to disk buffer but does some formatting along the way. The structure of the metadata log on disk is as follows: | checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... | | <---- checkpointed data -----> | etc. Every batch of metadata_log entries is preceded by a checkpoint entry. Appending metadata log appends the current batch of entries. Flushing or lack of space ends current batch of entries and then checkpoint entry is updated (because it holds CRC code of all checkpointed data) and then write of the whole batch is requested and a new checkpoint (if there is space for that) is started. Last checkpoint in a cluster contains a special entry pointing to the next cluster that is utilized by the metadata log. Bootstraping is, in fact, just replying of all actions from metadata log that were saved on disk. It works as follows: - reads metadata log clusters one by one - for each cluster, until the last checkpoint contains pointer to the next cluster, processes the checkpoint and entries it checkpoints - processing works as follows: - checkpoint entry is read and if it is invalid it means that the metadata log ends here (last checkpoint was partially written or the metadata log really ended here or there was some data corruption...) and we stop - if it is correct, it contains the length of the checkpointed data (metadata log entries), so then we process all of them (error there indicates that there was data corruption but CRC is still somehow correct, so we abort all bootstraping with an error) Locking is to ensure that concurrent modifications of the metadata do not corrupt it. E.g. Creating a file is a complex operation: you have to create inode and add a directory entry that will represent this inode with a path and write corresponding metadata log entries to the disk. Simultaneous attempts of creating the same file could corrupt the file system. Not to mention concurrent create and unlink on the same path... Thus careful and robust locking mechanism is used. For details see metadata_log.hh. Signed-off-by: Krzysztof Małysa <[email protected]>

Creating unlinked file may be useful as temporary file or to expose the file via path only after the file is filled with contents. Signed-off-by: Krzysztof Małysa <[email protected]>

Signed-off-by: Krzysztof Małysa <[email protected]>

Some operations need to schedule deleting inode in the background. One of these is closing unlinked file if nobody else holds it open. Signed-off-by: Krzysztof Małysa <[email protected]>

Allows the same file to be visible via different paths or to give a path to an unlinked file. Signed-off-by: Krzysztof Małysa <[email protected]>

Signed-off-by: Krzysztof Małysa <[email protected]>

Marks that the file is opened by increasing the opened file counter. Signed-off-by: Michał Niciejewski <[email protected]>

Decreases opened file counter. If the file is unlinked and the counter is zero then the file is automatically removed. Signed-off-by: Michał Niciejewski <[email protected]>

Each write can be divided into multiple smaller writes that can fall into one of the following categories: - small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes are stored fully in memory - medium write: writes above SMALL_WRITE_THRESHOLD and below cluster_size bytes, those writes are stored on disk, they are appended to the on-disk data log where data from different writes can be stored in one cluster - big write: writes that fully fit into one cluster, stored on disk For example, one write can be divided into multiple big writes, some small writes and some medium writes. Current implementation won't make any unnecessary data copying. Data given by caller is either directly used to write to disk or is copied as a small write. Added cluster writer which is used to perform medium writes. Cluster writer keeps a current position in the data log and appends new data by writing it directly into disk. Signed-off-by: Michał Niciejewski <[email protected]>

Truncate can be used on a file to change its size. When the new size is lower than current, the data at higher offsets will be lost, and when it's larger, the file will be filled with null bytes. Signed-off-by: Wojciech Mitros <[email protected]>

Reads file data from disk and memory based on information stored in inode's data vectors. Not optimized version - reads from disk are always read into temporary buffers before copying to the buffer given by the caller. Signed-off-by: Michał Niciejewski <[email protected]>

Provides inteface to query file attributes that include permissions, btime, mtime and ctime. Signed-off-by: Krzysztof Małysa <[email protected]>

The test checks whether the data written by a to_disk_buffer to disk is the same as the data appended to the buffer and the remaining buffer space is correctly calculated on small examples. Signed-off-by: Wojciech Mitros <[email protected]>

Added mockers: - mockers store information about every operation - store list of virtually created mockers Added tests for metadata_to_disk_buffer mocker. Tests check that mocker behaves similarly to metadata_to_disk_buffer. Signed-off-by: Michał Niciejewski <[email protected]>

- random tests - tests for corner cases * basic single small writes * basic single medium writes * basic single large writes * new cluster allocation for medium writes * medium write split into two smaller writes due to lack of space in data-log cluster * split single write into more smaller writes because of unaligned buffer * split big write (bigger than cluster size) into multiple writes Signed-off-by: Michał Niciejewski <[email protected]>

Checks whether the data that will be written to disk after truncate is correct, the reads from a truncated file are accurate and the files metadata is set to the new size. Signed-off-by: Wojciech Mitros <[email protected]>

For every ondisk entry check if: - it's correctly appended to the buffer when it would fit - the buffer returns TOO_BIG when it wouldn't fit - it's written to disk after successful append and flush. Signed-off-by: Wojciech Mitros <[email protected]>

Optimization for aligned reads. When on-disk data and given buffer are properly aligned than read disk data is not stored in a temporary buffer but is directly read into the buffer given by the caller. Added device_reader to perform unaligned reads with caching. Signed-off-by: Michał Niciejewski <[email protected]>

Random test checking aligned writes and reads optimizations. Signed-off-by: Michał Niciejewski <[email protected]>

Checks if there is access to the newly created directories after bootstrapping. Signed-off-by: Aleksander Sorokin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of the metadata log #61

Initial implementation of the metadata log #61

Commits on Apr 20, 2020