-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of the metadata log #61
Commits on Apr 20, 2020
-
fs: metadata_log: add base implementation
SeastarFS is a log-structured filesystem. Every shard will have 3 private logs: - metadata log - medium data log - big data log (this is not actually a log, but in the big picture it looks like it was) Disk space is divided into clusters (typically around several MiB) that have all equal size that is multiple of alignment (typically 4096 bytes). Each shard has its private pool of clusters (assignment is stored in bootstrap record). Each log consumes clusters one by one -- it writes the current one and if cluster becomes full, then log switches to a new one that is obtained from a pool of free clusters managed by cluster_allocator. Metadata log and medium data log write data in the same manner: they fill up the cluster gradually from left to right. Big data log takes a cluster and completely fills it with data at once -- it is only used during big writes. This commit adds the skeleton of the metadata log: - data structures for holding metadata in memory with all operations on this data structure i.e. manipulating files and their contents - locking logic (detailed description can be found in metadata_log.hh) - buffers for writting logs to disk (one for metadata and one for medium data) - basic higher level interface e.g. path lookup, iterating over directory - boostraping metadata log == reading metadata log from disk and reconstructing shard's filesystem structure from just before shutdown File content is stored as a set of data vectors that may have one of three kinds: in memory data, on disk data, hole. Small writes are writted directly to the metadata log and because all metadata is stored in the memory these writes are also in memory, therefore in-memory kind. Medium and large data are not stored in memory, so they are represented using on-disk kind. Enlarging file via truncate may produce holes, hence hole kind. Directory entries are stored as metadata log entries -- directory inodes have no content. To disk buffers buffer data that will be written to disk. There are two kinds: (normal) to disk buffer and metadata to disk buffer. The latter is implemented using the former, but provides higher level interface for appending metadata log entries rather than raw bytes. Normal to disk buffer appends data sequentially, but if a flush occurs the offset where next data will be appended is aligned up to alignment to ensure that writes to the same cluster are non-overlaping. Metadata to disk buffer appends data using normal to disk buffer but does some formatting along the way. The structure of the metadata log on disk is as follows: | checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... | | <---- checkpointed data -----> | etc. Every batch of metadata_log entries is preceded by a checkpoint entry. Appending metadata log appends the current batch of entries. Flushing or lack of space ends current batch of entries and then checkpoint entry is updated (because it holds CRC code of all checkpointed data) and then write of the whole batch is requested and a new checkpoint (if there is space for that) is started. Last checkpoint in a cluster contains a special entry pointing to the next cluster that is utilized by the metadata log. Bootstraping is, in fact, just replying of all actions from metadata log that were saved on disk. It works as follows: - reads metadata log clusters one by one - for each cluster, until the last checkpoint contains pointer to the next cluster, processes the checkpoint and entries it checkpoints - processing works as follows: - checkpoint entry is read and if it is invalid it means that the metadata log ends here (last checkpoint was partially written or the metadata log really ended here or there was some data corruption...) and we stop - if it is correct, it contains the length of the checkpointed data (metadata log entries), so then we process all of them (error there indicates that there was data corruption but CRC is still somehow correct, so we abort all bootstraping with an error) Locking is to ensure that concurrent modifications of the metadata do not corrupt it. E.g. Creating a file is a complex operation: you have to create inode and add a directory entry that will represent this inode with a path and write corresponding metadata log entries to the disk. Simultaneous attempts of creating the same file could corrupt the file system. Not to mention concurrent create and unlink on the same path... Thus careful and robust locking mechanism is used. For details see metadata_log.hh. Signed-off-by: Krzysztof Małysa <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for acd7baf - Browse repository at this point
Copy the full SHA acd7bafView commit details -
fs: metadata_log: add operation for creating and opening unlinked file
Creating unlinked file may be useful as temporary file or to expose the file via path only after the file is filled with contents. Signed-off-by: Krzysztof Małysa <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fdccca0 - Browse repository at this point
Copy the full SHA fdccca0View commit details -
fs: metadata_log: add creating files and directories
Signed-off-by: Krzysztof Małysa <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 477b6d4 - Browse repository at this point
Copy the full SHA 477b6d4View commit details -
fs: metadata_log: add private operation for deleting inode
Some operations need to schedule deleting inode in the background. One of these is closing unlinked file if nobody else holds it open. Signed-off-by: Krzysztof Małysa <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ac00461 - Browse repository at this point
Copy the full SHA ac00461View commit details -
fs: metadata_log: add link operation
Allows the same file to be visible via different paths or to give a path to an unlinked file. Signed-off-by: Krzysztof Małysa <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b20b693 - Browse repository at this point
Copy the full SHA b20b693View commit details -
fs: metadata_log: add unlinking files and removing directories
Signed-off-by: Krzysztof Małysa <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f18a1a1 - Browse repository at this point
Copy the full SHA f18a1a1View commit details -
fs: metadata_log: add opening file
Marks that the file is opened by increasing the opened file counter. Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d03440b - Browse repository at this point
Copy the full SHA d03440bView commit details -
fs: metadata_log: add closing file
Decreases opened file counter. If the file is unlinked and the counter is zero then the file is automatically removed. Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a36fd7f - Browse repository at this point
Copy the full SHA a36fd7fView commit details -
fs: metadata_log: add write operation
Each write can be divided into multiple smaller writes that can fall into one of the following categories: - small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes are stored fully in memory - medium write: writes above SMALL_WRITE_THRESHOLD and below cluster_size bytes, those writes are stored on disk, they are appended to the on-disk data log where data from different writes can be stored in one cluster - big write: writes that fully fit into one cluster, stored on disk For example, one write can be divided into multiple big writes, some small writes and some medium writes. Current implementation won't make any unnecessary data copying. Data given by caller is either directly used to write to disk or is copied as a small write. Added cluster writer which is used to perform medium writes. Cluster writer keeps a current position in the data log and appends new data by writing it directly into disk. Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d963289 - Browse repository at this point
Copy the full SHA d963289View commit details -
fs: metadata_log: add truncate operation
Truncate can be used on a file to change its size. When the new size is lower than current, the data at higher offsets will be lost, and when it's larger, the file will be filled with null bytes. Signed-off-by: Wojciech Mitros <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for dc45656 - Browse repository at this point
Copy the full SHA dc45656View commit details -
fs: metadata_log: add read operation
Reads file data from disk and memory based on information stored in inode's data vectors. Not optimized version - reads from disk are always read into temporary buffers before copying to the buffer given by the caller. Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d87caf0 - Browse repository at this point
Copy the full SHA d87caf0View commit details -
fs: metadata_log: add stat() operation
Provides inteface to query file attributes that include permissions, btime, mtime and ctime. Signed-off-by: Krzysztof Małysa <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4943abe - Browse repository at this point
Copy the full SHA 4943abeView commit details -
tests: fs: add to_disk_buffer test
The test checks whether the data written by a to_disk_buffer to disk is the same as the data appended to the buffer and the remaining buffer space is correctly calculated on small examples. Signed-off-by: Wojciech Mitros <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4dab3d9 - Browse repository at this point
Copy the full SHA 4dab3d9View commit details -
tests: fs: added metadata_to_disk_buffer and cluster_writer mockers
Added mockers: - mockers store information about every operation - store list of virtually created mockers Added tests for metadata_to_disk_buffer mocker. Tests check that mocker behaves similarly to metadata_to_disk_buffer. Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 48f5095 - Browse repository at this point
Copy the full SHA 48f5095View commit details -
- random tests - tests for corner cases * basic single small writes * basic single medium writes * basic single large writes * new cluster allocation for medium writes * medium write split into two smaller writes due to lack of space in data-log cluster * split single write into more smaller writes because of unaligned buffer * split big write (bigger than cluster size) into multiple writes Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b1b184a - Browse repository at this point
Copy the full SHA b1b184aView commit details -
tests: fs: add truncate operation test
Checks whether the data that will be written to disk after truncate is correct, the reads from a truncated file are accurate and the files metadata is set to the new size. Signed-off-by: Wojciech Mitros <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d053418 - Browse repository at this point
Copy the full SHA d053418View commit details -
tests: fs: add metadata_to_disk_buffer unit tests
For every ondisk entry check if: - it's correctly appended to the buffer when it would fit - the buffer returns TOO_BIG when it wouldn't fit - it's written to disk after successful append and flush. Signed-off-by: Wojciech Mitros <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 47f685e - Browse repository at this point
Copy the full SHA 47f685eView commit details -
fs: read: add optimization for aligned reads
Optimization for aligned reads. When on-disk data and given buffer are properly aligned than read disk data is not stored in a temporary buffer but is directly read into the buffer given by the caller. Added device_reader to perform unaligned reads with caching. Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 23f9e3e - Browse repository at this point
Copy the full SHA 23f9e3eView commit details -
tests: fs: add tests for aligned reads and writes
Random test checking aligned writes and reads optimizations. Signed-off-by: Michał Niciejewski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7b96070 - Browse repository at this point
Copy the full SHA 7b96070View commit details -
tests: fs: add basic test for metadata log bootstrapping
Checks if there is access to the newly created directories after bootstrapping. Signed-off-by: Aleksander Sorokin <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 47620f0 - Browse repository at this point
Copy the full SHA 47620f0View commit details