Snap 2358 Sorted Column Batches on partitioning keys #1054

vibhaska · 2018-06-18T09:08:48Z

Changes proposed in this pull request

Now user can create sorted Column Batches on partitioning keys using DDL mentioned below. This will keep a column batch in sorted manner that can be leveraged for better performance of point queries, range queries and colocated join queries on partitioning columns. For more details please refer https://jira.snappydata.io/browse/SNAP-2358

TODO:

Would open Jira tickets for pending items or any suggestion from code review.
Taking care of one known debugging issues. Also take care of similar bugs.

Patch testing

Unit test
Precheckin

ReleaseNotes.txt changes

A sample DDL to create table with sorted partitioning columns is,
session.sql(s"create table $colTableName (id int, addr string, status boolean) " +
s"using column options(buckets '$numBuckets', partition_by 'id SORTING ASC' " + s")")

If no sorting is required, above DDL would be,
session.sql(s"create table $colTableName (id int, addr string, status boolean) " +
s"using column options(buckets '$numBuckets', partition_by 'id' " + s")")

Valid sorting identifiers are,
SORTING ASC
SORTING DESC
SORTING Ascending
SORTING Descending

Other PRs

TIBCOSoftware/snappy-store#395

…urpose.

Includes the changes for the two issues and a bunch of other fixes found in testing. - Implementation of StoreCallbacks.columnTableScan that translates Filters to Expressions and generates the code to apply the same locally to ColumnBatchIterator stats rows - Changed smart connector iterator to use the new COLUMN_TABLE_SCAN procedure instead of multiple queries. - Added passing of Filters to the plans and recreation of those in getPartitions if the parameter values have changed for ParamLiterals - Fixed RowFormatScanRDD to regenerate filter clause in getPartitions if the parameter values have changed for ParamLiterals (not seen earlier because index columns were incorrect which has been fixed in store) - Perf fix to ColumnFormatIterator: keep track of updated delta stats separately with forced faultin like the full stats in DiskMultiColumnBatch so that entire batch does not need to be read if filter can skip using stats - Perf fix to RemoteEntriesIterator: - fetch both full stats and delta stats rows when fetching keys first time - sort and ensure both rows of a batch are together when fetching other columns - Updated ParamLiteral serialization to replace its value with the updated one in LiteralValue since parameter may have changed (but the base Literal.value is a val and cannot be changed) - Corrected RDD to be cleared in CachedDataFrame to use either the cachedRDD or the last used one for execution. - Added handling of the new DECOMPRESS_IF_IN_MEMORY fetch type to return self (or null) if decompression cannot replace the underlying in-memory value - Updated ColumnFormatValue to store disk RegionEntry instead of diskId since latter can change. - Improved performance of Snappy stats iterator to avoid lookup of deleted bitmask column for every stats column, rather iterate both of them and add negative size for deletes if any. - Added more transient expected exception types to SnappyTestRunner. - Updated store link.

also refactored PooledKryoSerializer to add generic serialize/deserialize methods that accept closures

- primary reason being that StringStartsWith Filter requires a string as pattern and cannot hold "Any" so the hack of stuffing in a ParamLiteral inside Filter does not work; now using Expression which are translated to Filter just before use if required - pushdown of filters from smart connector to server still uses Filter after conversion from Expression and when the ParamLiterals have been substituted with current values - removed awkward handling of ParamLiterals inside Filters as a result of above changes - fixed the StartsWith stats filter to use a ParamLiteral and generated code for the comparison against stats row bounds

…edQueryRoutingSingleNodeSuite was failing with NPE exception since DefaultSource did not have relevant properties.

…d insert. Need to implement again and also address cases of group by. Now this project is not dependent on AQP changes.

vivekwiz and others added 30 commits March 7, 2018 13:08

Merge branch 'master' into li-perfomance-test

f6fd03d

Merge branch 'li-perfomance-test' into vivek-try1

1543f6f

Reverted benchmark. Now use QueryBenchmark only for data generation p…

aad6fe1

…urpose.

Updated test with new values

30d4947

Merge branch 'li-perfomance-test' into vivek-try1

b6b5849

Updated test

40ed17c

test code refactoring

016a60a

Merge branch 'li-perfomance-test' into vivek-try1

9466d7c

Updated test

233f134

Updated test

4bed89d

Merge branch 'li-perfomance-test' into vivek-try1

1198a93

Updated test

0111bc1

Updated test to handle multiple inserts

7a4cba3

Merge branch 'li-perfomance-test' into vivek-try1

21cd499

Compilation issue

0b5c110

Added more duplicity in data

3d7eded

Merge branch 'li-perfomance-test' into vivek-try1

88b1462

Updated params for test

5fbbd44

Merge branch 'li-perfomance-test' into vivek-try1

7266fd4

Updated expectedcount

5ad492c

Merge branch 'li-perfomance-test' into vivek-try1

b4d8ffe

Updated expected count in test

b7a351a

Updated estimated count

ce2f4a1

Updated tests

a9c987a

fixed test failures

017b5ac

also refactored PooledKryoSerializer to add generic serialize/deserialize methods that accept closures

Merge remote-tracking branch 'origin/master' into SNAP-2243

0a9107e

fixed issues in the new StartsWithForStats expression code

e3b4d41

removed a debug assertion

a22b104

vivekwiz and others added 12 commits June 21, 2018 22:04

Removed extra debug flag

0c2a6ff

Disable dunit and performance tests by default

ea735fc

Code rectoring to bring code generated changes under a flag

f6678f3

Further refactoring for generated code and update

c2222e4

Further refactor column delta related code for taking under a flag

750e13e

For some scenario like test "update delete on column table" of Prepar…

fefdb54

…edQueryRoutingSingleNodeSuite was failing with NPE exception since DefaultSource did not have relevant properties.

Correctly taking care of sorting order provided by user

3f88a75

Merge branch 'master' into SNAP-2358

c2433a5

Removing an optimization that used iterator only for cases of join an…

ad7ea90

…d insert. Need to implement again and also address cases of group by. Now this project is not dependent on AQP changes.

Fix for an issue reklated to sorting

7c894fb

resolved merge conflict with master in SnappySessionState

61598ea

merging from master

52b8ccf

ashetkar force-pushed the master branch from b73485e to f740fee Compare April 20, 2021 09:04

ashetkar force-pushed the SNAP-2358 branch from 4ea44a9 to 52b8ccf Compare April 20, 2021 09:08

sumwale force-pushed the master branch from 1e636db to e1d45b2 Compare June 26, 2021 19:41

sumwale force-pushed the master branch from 8cc4798 to 5f5c15d Compare July 14, 2021 18:12

sumwale force-pushed the master branch 5 times, most recently from 8b43301 to 2b254d9 Compare October 1, 2021 09:23

sumwale force-pushed the master branch 5 times, most recently from 2c254f0 to 0f2888f Compare October 18, 2021 17:01

sumwale force-pushed the master branch 2 times, most recently from a466d26 to ea127bd Compare April 12, 2022 10:05

sumwale force-pushed the master branch 2 times, most recently from 99ec79c to c7b84fa Compare June 12, 2022 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snap 2358 Sorted Column Batches on partitioning keys #1054

Snap 2358 Sorted Column Batches on partitioning keys #1054

vibhaska commented Jun 18, 2018 •

edited

Loading

Snap 2358 Sorted Column Batches on partitioning keys #1054

Are you sure you want to change the base?

Snap 2358 Sorted Column Batches on partitioning keys #1054

Conversation

vibhaska commented Jun 18, 2018 • edited Loading

Changes proposed in this pull request

Patch testing

ReleaseNotes.txt changes

Other PRs

vibhaska commented Jun 18, 2018 •

edited

Loading