Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snap 2358 Sorted Column Batches on partitioning keys #1054

Open
wants to merge 392 commits into
base: master
Choose a base branch
from

Conversation

vibhaska
Copy link
Contributor

@vibhaska vibhaska commented Jun 18, 2018

Changes proposed in this pull request

Now user can create sorted Column Batches on partitioning keys using DDL mentioned below. This will keep a column batch in sorted manner that can be leveraged for better performance of point queries, range queries and colocated join queries on partitioning columns. For more details please refer https://jira.snappydata.io/browse/SNAP-2358

TODO:

  1. Would open Jira tickets for pending items or any suggestion from code review.
  2. Taking care of one known debugging issues. Also take care of similar bugs.

Patch testing

Unit test
Precheckin

ReleaseNotes.txt changes

A sample DDL to create table with sorted partitioning columns is,
session.sql(s"create table $colTableName (id int, addr string, status boolean) " +
s"using column options(buckets '$numBuckets', partition_by 'id SORTING ASC' " + s")")

If no sorting is required, above DDL would be,
session.sql(s"create table $colTableName (id int, addr string, status boolean) " +
s"using column options(buckets '$numBuckets', partition_by 'id' " + s")")

Valid sorting identifiers are,
SORTING ASC
SORTING DESC
SORTING Ascending
SORTING Descending

Other PRs

TIBCOSoftware/snappy-store#395

vivekwiz and others added 30 commits March 7, 2018 13:08
Includes the changes for the two issues and a bunch of other fixes found in testing.

- Implementation of StoreCallbacks.columnTableScan that translates Filters to Expressions
  and generates the code to apply the same locally to ColumnBatchIterator stats rows
- Changed smart connector iterator to use the new COLUMN_TABLE_SCAN procedure instead
  of multiple queries.
- Added passing of Filters to the plans and recreation of those in getPartitions if
  the parameter values have changed for ParamLiterals
- Fixed RowFormatScanRDD to regenerate filter clause in getPartitions if the parameter
  values have changed for ParamLiterals (not seen earlier because index columns were
    incorrect which has been fixed in store)
- Perf fix to ColumnFormatIterator: keep track of updated delta stats separately with
  forced faultin like the full stats in DiskMultiColumnBatch so that entire batch does
  not need to be read if filter can skip using stats
- Perf fix to RemoteEntriesIterator:
  - fetch both full stats and delta stats rows when fetching keys first time
  - sort and ensure both rows of a batch are together when fetching other columns
- Updated ParamLiteral serialization to replace its value with the updated one in LiteralValue
  since parameter may have changed (but the base Literal.value is a val and cannot be changed)
- Corrected RDD to be cleared in CachedDataFrame to use either the cachedRDD or the
  last used one for execution.
- Added handling of the new DECOMPRESS_IF_IN_MEMORY fetch type to return self (or null)
  if decompression cannot replace the underlying in-memory value
- Updated ColumnFormatValue to store disk RegionEntry instead of diskId since latter can change.
- Improved performance of Snappy stats iterator to avoid lookup of deleted bitmask column for every
  stats column, rather iterate both of them and add negative size for deletes if any.
- Added more transient expected exception types to SnappyTestRunner.
- Updated store link.
also refactored PooledKryoSerializer to add generic serialize/deserialize methods
that accept closures
- primary reason being that StringStartsWith Filter requires a string as pattern
  and cannot hold "Any" so the hack of stuffing in a ParamLiteral inside Filter
  does not work; now using Expression which are translated to Filter just before use if required
- pushdown of filters from smart connector to server still uses Filter after conversion
  from Expression and when the ParamLiterals have been substituted with current values
- removed awkward handling of ParamLiterals inside Filters as a result of above changes
- fixed the StartsWith stats filter to use a ParamLiteral and generated code for the
  comparison against stats row bounds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants