Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of the metadata log #61

Merged
merged 20 commits into from
Apr 20, 2020
Merged

Initial implementation of the metadata log #61

merged 20 commits into from
Apr 20, 2020

Commits on Apr 20, 2020

  1. fs: metadata_log: add base implementation

    SeastarFS is a log-structured filesystem. Every shard will have 3
    private logs:
    - metadata log
    - medium data log
    - big data log (this is not actually a log, but in the big picture it
      looks like it was)
    
    Disk space is divided into clusters (typically around several MiB) that
    have all equal size that is multiple of alignment (typically 4096
    bytes). Each shard has its private pool of clusters (assignment is
    stored in bootstrap record). Each log consumes clusters one by one -- it
    writes the current one and if cluster becomes full, then log switches to
    a new one that is obtained from a pool of free clusters managed by
    cluster_allocator. Metadata log and medium data log write data in the
    same manner: they fill up the cluster gradually from left to right. Big
    data log takes a cluster and completely fills it with data at once -- it
    is only used during big writes.
    
    This commit adds the skeleton of the metadata log:
    - data structures for holding metadata in memory with all operations on
      this data structure i.e. manipulating files and their contents
    - locking logic (detailed description can be found in metadata_log.hh)
    - buffers for writting logs to disk (one for metadata and one for medium
      data)
    - basic higher level interface e.g. path lookup, iterating over
      directory
    - boostraping metadata log == reading metadata log from disk and
      reconstructing shard's filesystem structure from just before shutdown
    
    File content is stored as a set of data vectors that may have one of
    three kinds: in memory data, on disk data, hole. Small writes are
    writted directly to the metadata log and because all metadata is stored
    in the memory these writes are also in memory, therefore in-memory kind.
    Medium and large data are not stored in memory, so they are represented
    using on-disk kind. Enlarging file via truncate may produce holes, hence
    hole kind.
    
    Directory entries are stored as metadata log entries -- directory inodes
    have no content.
    
    To disk buffers buffer data that will be written to disk. There are two
    kinds: (normal) to disk buffer and metadata to disk buffer. The latter
    is implemented using the former, but provides higher level interface for
    appending metadata log entries rather than raw bytes.
    
    Normal to disk buffer appends data sequentially, but if a flush occurs
    the offset where next data will be appended is aligned up to alignment
    to ensure that writes to the same cluster are non-overlaping.
    
    Metadata to disk buffer appends data using normal to disk buffer but
    does some formatting along the way. The structure of the metadata log on
    disk is as follows:
    | checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... |
                   | <---- checkpointed data -----> |
    etc. Every batch of metadata_log entries is preceded by a checkpoint
    entry. Appending metadata log appends the current batch of entries.
    Flushing or lack of space ends current batch of entries and then
    checkpoint entry is updated (because it holds CRC code of all
    checkpointed data) and then write of the whole batch is requested and a
    new checkpoint (if there is space for that) is started. Last checkpoint
    in a cluster contains a special entry pointing to the next cluster that
    is utilized by the metadata log.
    
    Bootstraping is, in fact, just replying of all actions from metadata log
    that were saved on disk. It works as follows:
    - reads metadata log clusters one by one
    - for each cluster, until the last checkpoint contains pointer to the
      next cluster, processes the checkpoint and entries it checkpoints
    - processing works as follows:
      - checkpoint entry is read and if it is invalid it means that the
        metadata log ends here (last checkpoint was partially written or the
        metadata log really ended here or there was some data corruption...)
        and we stop
      - if it is correct, it contains the length of the checkpointed data
        (metadata log entries), so then we process all of them (error there
        indicates that there was data corruption but CRC is still somehow
        correct, so we abort all bootstraping with an error)
    
    Locking is to ensure that concurrent modifications of the metadata do
    not corrupt it. E.g. Creating a file is a complex operation: you have
    to create inode and add a directory entry that will represent this inode
    with a path and write corresponding metadata log entries to the disk.
    Simultaneous attempts of creating the same file could corrupt the file
    system. Not to mention concurrent create and unlink on the same path...
    Thus careful and robust locking mechanism is used. For details see
    metadata_log.hh.
    
    Signed-off-by: Krzysztof Małysa <[email protected]>
    varqox authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    acd7baf View commit details
    Browse the repository at this point in the history
  2. fs: metadata_log: add operation for creating and opening unlinked file

    Creating unlinked file may be useful as temporary file or to expose the
    file via path only after the file is filled with contents.
    
    Signed-off-by: Krzysztof Małysa <[email protected]>
    varqox authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    fdccca0 View commit details
    Browse the repository at this point in the history
  3. fs: metadata_log: add creating files and directories

    Signed-off-by: Krzysztof Małysa <[email protected]>
    varqox authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    477b6d4 View commit details
    Browse the repository at this point in the history
  4. fs: metadata_log: add private operation for deleting inode

    Some operations need to schedule deleting inode in the background. One
    of these is closing unlinked file if nobody else holds it open.
    
    Signed-off-by: Krzysztof Małysa <[email protected]>
    varqox authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    ac00461 View commit details
    Browse the repository at this point in the history
  5. fs: metadata_log: add link operation

    Allows the same file to be visible via different paths or to give a path
    to an unlinked file.
    
    Signed-off-by: Krzysztof Małysa <[email protected]>
    varqox authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    b20b693 View commit details
    Browse the repository at this point in the history
  6. fs: metadata_log: add unlinking files and removing directories

    Signed-off-by: Krzysztof Małysa <[email protected]>
    varqox authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    f18a1a1 View commit details
    Browse the repository at this point in the history
  7. fs: metadata_log: add opening file

    Marks that the file is opened by increasing the opened file counter.
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    d03440b View commit details
    Browse the repository at this point in the history
  8. fs: metadata_log: add closing file

    Decreases opened file counter. If the file is unlinked and the
    counter is zero then the file is automatically removed.
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    a36fd7f View commit details
    Browse the repository at this point in the history
  9. fs: metadata_log: add write operation

    Each write can be divided into multiple smaller writes that can fall
    into one of the following categories:
    - small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes
      are stored fully in memory
    - medium write: writes above SMALL_WRITE_THRESHOLD and below
      cluster_size bytes, those writes are stored on disk, they are appended
      to the on-disk data log where data from different writes can be stored
      in one cluster
    - big write: writes that fully fit into one cluster, stored on disk
    For example, one write can be divided into multiple big writes, some
    small writes and some medium writes. Current implementation won't make
    any unnecessary data copying. Data given by caller is either directly
    used to write to disk or is copied as a small write.
    
    Added cluster writer which is used to perform medium writes. Cluster
    writer keeps a current position in the data log and appends new data
    by writing it directly into disk.
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    d963289 View commit details
    Browse the repository at this point in the history
  10. fs: metadata_log: add truncate operation

    Truncate can be used on a file to change its size. When the new
    size is lower than current, the data at higher offsets will be lost,
    and when it's larger, the file will be filled with null bytes.
    
    Signed-off-by: Wojciech Mitros <[email protected]>
    wmitros authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    dc45656 View commit details
    Browse the repository at this point in the history
  11. fs: metadata_log: add read operation

    Reads file data from disk and memory based on information stored in
    inode's data vectors. Not optimized version - reads from disk are always
    read into temporary buffers before copying to the buffer given by the
    caller.
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    d87caf0 View commit details
    Browse the repository at this point in the history
  12. fs: metadata_log: add stat() operation

    Provides inteface to query file attributes that include permissions,
    btime, mtime and ctime.
    
    Signed-off-by: Krzysztof Małysa <[email protected]>
    varqox authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    4943abe View commit details
    Browse the repository at this point in the history
  13. tests: fs: add to_disk_buffer test

    The test checks whether the data written by a to_disk_buffer to disk
    is the same as the data appended to the buffer and the remaining buffer
    space is correctly calculated on small examples.
    
    Signed-off-by: Wojciech Mitros <[email protected]>
    wmitros authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    4dab3d9 View commit details
    Browse the repository at this point in the history
  14. tests: fs: added metadata_to_disk_buffer and cluster_writer mockers

    Added mockers:
    - mockers store information about every operation
    - store list of virtually created mockers
    
    Added tests for metadata_to_disk_buffer mocker. Tests check that
    mocker behaves similarly to metadata_to_disk_buffer.
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    48f5095 View commit details
    Browse the repository at this point in the history
  15. tests: fs: add write test

    - random tests
    - tests for corner cases
      * basic single small writes
      * basic single medium writes
      * basic single large writes
      * new cluster allocation for medium writes
      * medium write split into two smaller writes due to lack of space in
        data-log cluster
      * split single write into more smaller writes because of unaligned
        buffer
      * split big write (bigger than cluster size) into multiple writes
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    b1b184a View commit details
    Browse the repository at this point in the history
  16. tests: fs: add truncate operation test

    Checks whether the data that will be written to disk after truncate is correct,
    the reads from a truncated file are accurate and the files metadata is set
    to the new size.
    
    Signed-off-by: Wojciech Mitros <[email protected]>
    wmitros authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    d053418 View commit details
    Browse the repository at this point in the history
  17. tests: fs: add metadata_to_disk_buffer unit tests

    For every ondisk entry check if:
    - it's correctly appended to the buffer when it would fit
    - the buffer returns TOO_BIG when it wouldn't fit
    - it's written to disk after successful append and flush.
    
    Signed-off-by: Wojciech Mitros <[email protected]>
    wmitros authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    47f685e View commit details
    Browse the repository at this point in the history
  18. fs: read: add optimization for aligned reads

    Optimization for aligned reads. When on-disk data and given buffer are
    properly aligned than read disk data is not stored in a temporary
    buffer but is directly read into the buffer given by the caller.
    
    Added device_reader to perform unaligned reads with caching.
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    23f9e3e View commit details
    Browse the repository at this point in the history
  19. tests: fs: add tests for aligned reads and writes

    Random test checking aligned writes and reads optimizations.
    
    Signed-off-by: Michał Niciejewski <[email protected]>
    tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    7b96070 View commit details
    Browse the repository at this point in the history
  20. tests: fs: add basic test for metadata log bootstrapping

    Checks if there is access to the newly created directories after bootstrapping.
    
    Signed-off-by: Aleksander Sorokin <[email protected]>
    rokinsky authored and tropuq committed Apr 20, 2020
    Configuration menu
    Copy the full SHA
    47620f0 View commit details
    Browse the repository at this point in the history