You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current model is to partition data like this =>
dataset
year_YYYY
month_MM
day_DD
Where the date parts of the name are the date the record is 'written' (although it can be changed, this is what it usually means).
Problems:
These dates are then not available to be addressed directly, they're available indirectly with start_date and end_date variables only.
Partitioning by fields is clunky, involves the stream writer being used and not knowing when the set is finished, which means subsequent reruns need to clean the slate before it can start writing.
Goals:
over 100 files/subfolders in a folder only by exception
able to filter by dates, addressable like "DATASET_2022%" or the existing start_date, end_date
able to partition by arbitrary fields
able to read specific partitions (read all of the blobs below, if there's a set of by_ folders, only read one set)
able to query the list of partitions at a given level to enable looping or searching them
partitions able to be marked as complete and include some summary statistics
When writing, the partition can be set manually, it can get it from config, or it has a default. Information about each of the partitions is kept in memory, and can be preloaded if needed (e.g. a job is continued).
The metadata collector stops at the DATASET level (as it does now).
When reading if a level above the by_ folders is specified, a by_ is selected with the fewest partitions. Opteryx will use the query plan to decide if a by_ folder is better than another.
For backwards compatibility, if there's no by_ partition, it will treat every subfolder as being in a virtual * partition.
The text was updated successfully, but these errors were encountered:
Partitions => Partitions and Clusters
The current model is to partition data like this =>
Where the date parts of the name are the date the record is 'written' (although it can be changed, this is what it usually means).
Problems:
start_date
andend_date
variables only.Goals:
start_date
,end_date
by_
folders, only read one set)Change this to be
When writing, the partition can be set manually, it can get it from config, or it has a default. Information about each of the partitions is kept in memory, and can be preloaded if needed (e.g. a job is continued).
The metadata collector stops at the DATASET level (as it does now).
When reading if a level above the
by_
folders is specified, aby_
is selected with the fewest partitions. Opteryx will use the query plan to decide if aby_
folder is better than another.For backwards compatibility, if there's no
by_
partition, it will treat every subfolder as being in a virtual*
partition.The text was updated successfully, but these errors were encountered: