Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mito): Implement SST format for mito2 #2178

Merged
merged 38 commits into from
Aug 17, 2023

Conversation

evenyag
Copy link
Contributor

@evenyag evenyag commented Aug 15, 2023

I hereby agree to the terms of the GreptimeDB CLA

What's changed and what's your intention?

This PR implements the SST format for mito2 engine.

The new SST format encodes the primary keys in a memory-comparable format and stores them as dictionary arrays. We distinguish different time series by comparing the keys of the dictionary array while decoding the RecordBatch.

We store three internal columns in parquet:

  • __primary_key, the primary key of the row (tags).
  • __sequence, the sequence number of a row.
  • __op_type, the op type of the row.

The schema of a parquet file is:

field 0, field 1, ..., field N, time index, primary key, sequence, op type

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.

Refer to a related PR or issue link (optional)

@evenyag evenyag changed the title feat(mito): Implement SST format feat(mito): Implement SST format for mito2 Aug 15, 2023
@evenyag evenyag marked this pull request as ready for review August 16, 2023 08:21
@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Merging #2178 (241ff3e) into develop (8ea1763) will decrease coverage by 0.39%.
Report is 4 commits behind head on develop.
The diff coverage is 76.72%.

@@             Coverage Diff             @@
##           develop    #2178      +/-   ##
===========================================
- Coverage    84.68%   84.29%   -0.39%     
===========================================
  Files          698      700       +2     
  Lines       112701   113147     +446     
===========================================
- Hits         95437    95377      -60     
- Misses       17264    17770     +506     

src/mito2/src/sst/parquet/reader.rs Outdated Show resolved Hide resolved
src/mito2/src/sst/parquet/format.rs Outdated Show resolved Hide resolved
src/mito2/src/read.rs Outdated Show resolved Hide resolved
src/mito2/src/read.rs Outdated Show resolved Hide resolved
src/mito2/src/error.rs Outdated Show resolved Hide resolved
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except the missing tests

@evenyag evenyag self-assigned this Aug 17, 2023
Copy link
Contributor

@v0y4g3r v0y4g3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@waynexia waynexia added this pull request to the merge queue Aug 17, 2023
Merged via the queue into GreptimeTeam:develop with commit 4ba1215 Aug 17, 2023
13 checks passed
paomian pushed a commit to paomian/greptimedb that referenced this pull request Oct 19, 2023
* chore: update comment

* feat: stream writer takes arrow's types

* feat: Define Batch struct

* feat: arrow_schema_to_store

* refactor: rename

* feat: write parquet in new format with tsids

* feat: reader support projection

* feat: Impl read compat

* refactor: rename SchemaCompat to CompatRecordBatch

* feat: changing sst format

* feat: make it compile

* feat: remove tsid and some structs

* feat: from_sst_record_batch wip

* chore: push array

* chore: wip

* feat: decode batches from RecordBatch

* feat: reader converts record batches

* feat: remove compat mod

* chore: remove some codes

* feat: sort fields by column id

* test: test to_sst_arrow_schema

* feat: do not sort fields

* test: more test helpers

* feat: simplify projection

* fix: projection indices is incorrect

* refactor: define write/read format

* test: test write format

* test: test projection

* test: test convert record batch

* feat: remove unused errors

* refactor: wrap get_field_batch_columns

* chore: clippy

* chore: fix clippy

* feat: build arrow schema from region meta in ReadFormat

* feat: initialize the parquet reader at `build()`

* chore: fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants