Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Folder Structure Redesign #320

Open
joocer opened this issue Feb 15, 2022 · 0 comments
Open

[FEATURE] Folder Structure Redesign #320

joocer opened this issue Feb 15, 2022 · 0 comments
Assignees
Labels

Comments

@joocer
Copy link
Collaborator

joocer commented Feb 15, 2022

Partitions => Partitions and Clusters

The current model is to partition data like this =>

dataset
    year_YYYY
         month_MM
            day_DD

Where the date parts of the name are the date the record is 'written' (although it can be changed, this is what it usually means).

Problems:

  • These dates are then not available to be addressed directly, they're available indirectly with start_date and end_date variables only.
  • Partitioning by fields is clunky, involves the stream writer being used and not knowing when the set is finished, which means subsequent reruns need to clean the slate before it can start writing.

Goals:

  • over 100 files/subfolders in a folder only by exception
  • able to filter by dates, addressable like "DATASET_2022%" or the existing start_date, end_date
  • able to partition by arbitrary fields
  • able to read specific partitions (read all of the blobs below, if there's a set of by_ folders, only read one set)
  • able to query the list of partitions at a given level to enable looping or searching them
  • partitions able to be marked as complete and include some summary statistics

Change this to be

dataset
    year_YYYY
         month_MM
            day_DD
                hour_HH
                    minute_mm
                        by_<field_name>
                            <field_name>=<value>
                                as_at_<timestamp>

When writing, the partition can be set manually, it can get it from config, or it has a default. Information about each of the partitions is kept in memory, and can be preloaded if needed (e.g. a job is continued).

The metadata collector stops at the DATASET level (as it does now).

When reading if a level above the by_ folders is specified, a by_ is selected with the fewest partitions. Opteryx will use the query plan to decide if a by_ folder is better than another.

For backwards compatibility, if there's no by_ partition, it will treat every subfolder as being in a virtual * partition.

@joocer joocer added the feature label Feb 15, 2022
@joocer joocer self-assigned this Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant