HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

difin · 2024-05-13T21:36:45Z

…ution

What changes were proposed in this pull request?

Adding support for compacting a given partition of a Hive Iceberg table even if the table has undergone partition evolution. The partition spec can be current or one of the older partition specs of the table.

Why are the changes needed?

So far compaction on partition level wasn't supported for Hive Iceberg tables that have undergone partition evolution.

Does this PR introduce any user-facing change?

Yes. Users can now submit partition-level compaction requests for Hive Iceberg tables with partition spec that conforms to one of the previous partition specs in the table.

Is the change a dependency upgrade?

No

How was this patch tested?

New q-tests added

deniskuzZ · 2024-05-27T12:30:08Z

ql/src/java/org/apache/hadoop/hive/ql/ddl/table/storage/compact/AlterTableCompactOperation.java

 throw new HiveException(ErrorMsg.INVALID_PARTITION_SPEC);
 }
+ partitions = partitions.stream().filter(part -> part.getSpec().size() == partitionSpec.size()).collect(Collectors.toList());


what are we checking here? number of partitions in table spec and compaction request?

This validates that the partition spec given in the compaction command matches exactly one partition in the table, not a partial partition spec.

Let's say, a table has partitions with specs (a,b) and (a,b,c) because of evolution and a compaction command is run with spec (a,b). On line 144 it will find both partition specs and after filtering it will have only one (a,b) and will pass validation.

Another case, let's assume a table has the same partitions with specs (a,b) and (a,b,c) and a compaction command is run with spec (a). On line 144 it will find both partition specs and after filtering it will have zero partitions and will fail validation with TOO_MANY_COMPACTION_PARTITIONS exception.

… by spec by any past table specs. Moved Iceberg compaction constant to a class in Iceberg module. Use VirtualColumn.PARTITION_SPEC_ID.getName() instead of partition__spec__id.

SourabhBadhya · 2024-06-07T12:02:32Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

+ .map(x -> x.getJobConf().get(CompactorContext.COMPACTION_PARTITION_PATH))
+ .orElse(null);
+
+ if (rewritePolicy != RewritePolicy.DEFAULT || compactionPartSpecId != null) {


So is the value of rewritePolicy = ALL_PARTITIONS in the case of table level compaction on a fully partitioned table and rewritePolicy = PARTITION in the case of partition level compaction?

SourabhBadhya · 2024-06-07T12:09:59Z

ql/src/java/org/apache/hadoop/hive/ql/ddl/table/AbstractAlterTableAnalyzer.java

@@ -55,6 +57,9 @@ public void analyzeInternal(ASTNode root) throws SemanticException {
 if (command.getType() == HiveParser.TOK_ALTERTABLE_RENAMEPART) {
 partitionSpec = getPartSpec(partitionSpecNode);
 } else {
+ if (command.getType() == HiveParser.TOK_ALTERTABLE_COMPACT) {
+ HiveConf.setVar(conf, HiveConf.ConfVars.REWRITE_POLICY, Context.RewritePolicy.PARTITION.name());


Is it applied for partition compaction case or is it added generically to all cases?

rewritePolicy = ALL_PARTITIONS is added in case of full table compaction for partitioned and unpartitioned tables.

HiveConf.setVar(conf, HiveConf.ConfVars.REWRITE_POLICY, Context.RewritePolicy.PARTITION.name());
Is added for partition compaction case only. This code branch is reachable only when partitionSpecNode != null and command.getType() == HiveParser.TOK_ALTERTABLE_COMPACT

… partition spec in IcebergMajorQueryCompactor.

SourabhBadhya

LGTM +1 (pending tests)

sonarcloud · 2024-06-11T12:28:46Z

Quality Gate passed

Issues
15 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

deniskuzZ · 2024-06-13T14:55:35Z

...erg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java

+ .map(Integer::valueOf)
+ .orElse(null);
+
+ String compactionPartitionPath = outputTable.jobContexts.stream()


could we please use consistent naming, not

compactionPartSpecId/COMPACTION_PART_SPEC_ID compactionPartitionPath/COMPACTION_PARTITION_PATH

why do we need compaction in name? how about:

partitionSpecId/PARTITION_SPEC_ID partitionPath/PARTITION_PATH

also why don't we compute path only inside compaction block?

deniskuzZ · 2024-06-13T15:09:35Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java

+ return data;
+ }
+
+ public static Pair<List<DataFile>, List<DeleteFile>> getDataAndDeleteFiles(Table table, int specId,


why not to split it on 2, instead of this combined Pair<List, List>?
are we reusing something and saving cpu cycles?

deniskuzZ · 2024-06-13T15:22:01Z

ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java

@@ -746,6 +746,11 @@ default void validatePartSpec(org.apache.hadoop.hive.ql.metadata.Table hmsTable,
 throw new UnsupportedOperationException("Storage handler does not support validation of partition values");
 }

+ default void validatePartAnySpec(org.apache.hadoop.hive.ql.metadata.Table hmsTable, Map<String, String> partitionSpec)


what does it do?

deniskuzZ · 2024-06-13T15:25:34Z

ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java

@@ -1690,7 +1690,12 @@ public static Map<String, String> getPartSpec(ASTNode node)
 public static void validatePartSpec(Table tbl, Map<String, String> partSpec,
 ASTNode astNode, HiveConf conf, boolean shouldBeFull) throws SemanticException {
 if (tbl.getStorageHandler() != null && tbl.getStorageHandler().alwaysUnpartitioned()) {
- tbl.getStorageHandler().validatePartSpec(tbl, partSpec);
+ if (Context.RewritePolicy.fromString(conf.get(HiveConf.ConfVars.REWRITE_POLICY.varname, 
+ Context.RewritePolicy.DEFAULT.name())) == Context.RewritePolicy.PARTITION) {


why do you need to specify default?

deniskuzZ · 2024-06-13T15:26:20Z

ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java

@@ -801,6 +806,11 @@ default Partition getPartition(org.apache.hadoop.hive.ql.metadata.Table table, M
 throw new UnsupportedOperationException("Storage handler does not support getting partition for a table.");
 }

+ default Partition getPartitionAnySpec(org.apache.hadoop.hive.ql.metadata.Table table, 


what does it do, can you add meaningful javadoc?

deniskuzZ · 2024-06-13T15:30:38Z

ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java

@@ -3663,7 +3664,12 @@ public List<org.apache.hadoop.hive.metastore.api.Partition> getPartitionsByNames

 public Partition getPartition(Table tbl, Map<String, String> partSpec) throws HiveException {
 if (tbl.getStorageHandler() != null && tbl.getStorageHandler().alwaysUnpartitioned()) {
- return tbl.getStorageHandler().getPartition(tbl, partSpec);
+ if (Context.RewritePolicy.fromString(conf.get(ConfVars.REWRITE_POLICY.varname, 
+ Context.RewritePolicy.DEFAULT.name())) == Context.RewritePolicy.PARTITION) {


why do you need default

deniskuzZ · 2024-06-13T15:50:30Z

ql/src/java/org/apache/hadoop/hive/ql/Context.java

@@ -256,7 +256,8 @@ public String toString() {
 public enum RewritePolicy {

 DEFAULT,
- ALL_PARTITIONS;
+ ALL_PARTITIONS,
+ PARTITION;


why do you need separate policy for PARTITION? not ALL_PARTITIONS isn't enough?

deniskuzZ · 2024-06-13T15:59:06Z

...rg-handler/src/main/java/org/apache/iceberg/mr/hive/compaction/IcebergCompactionContext.java

+
+public class IcebergCompactionContext {
+
+ public static final String COMPACTION_PART_SPEC_ID = "compaction_part_spec_id";


put this to the rest of iceberg constants, no need for this class

deniskuzZ · 2024-06-13T16:02:21Z

...-handler/src/main/java/org/apache/iceberg/mr/hive/compaction/IcebergMajorQueryCompactor.java

+ List<Pair<PartitionData, Integer>> partitionList = Lists.newArrayList();
+ try (CloseableIterable<FileScanTask> fileScanTasks = partitionsTable.newScan().planFiles()) {
+ fileScanTasks.forEach(task ->
+ partitionList.addAll(Sets.newHashSet(CloseableIterable.transform(task.asDataTask().rows(), row -> {


are you creating HashSet from KV Pairs?

asf-ci-hive added tests pending tests unstable and removed tests pending labels May 13, 2024

difin force-pushed the compaction_part_evol branch from 99d309d to 4c36293 Compare May 14, 2024 01:08

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels May 14, 2024

difin force-pushed the compaction_part_evol branch from 4c36293 to 746d1f2 Compare May 14, 2024 23:43

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels May 14, 2024

difin force-pushed the compaction_part_evol branch from 746d1f2 to abcf246 Compare May 15, 2024 15:02

github-actions bot requested a review from miklosgergely May 15, 2024 15:03

asf-ci-hive added tests pending and removed tests passed labels May 15, 2024

difin force-pushed the compaction_part_evol branch from abcf246 to 7398bd5 Compare May 15, 2024 15:55

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels May 15, 2024

difin force-pushed the compaction_part_evol branch from 7398bd5 to c518d81 Compare May 16, 2024 14:21

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels May 16, 2024

deniskuzZ reviewed May 27, 2024

View reviewed changes

Added methods for validating partition spec and retrieving partitions…

cbd6a22

… by spec by any past table specs. Moved Iceberg compaction constant to a class in Iceberg module. Use VirtualColumn.PARTITION_SPEC_ID.getName() instead of partition__spec__id.

asf-ci-hive added tests pending tests unstable tests passed and removed tests passed tests pending tests unstable labels May 31, 2024

SourabhBadhya reviewed Jun 7, 2024

View reviewed changes

Added verification that there is no more than 1 partition for a given…

0a7d098

… partition spec in IcebergMajorQueryCompactor.

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Jun 10, 2024

SourabhBadhya approved these changes Jun 11, 2024

View reviewed changes

asf-ci-hive added tests pending and removed tests unstable labels Jun 11, 2024

asf-ci-hive added tests passed and removed tests pending labels Jun 11, 2024

deniskuzZ reviewed Jun 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

difin commented May 13, 2024 •

edited

deniskuzZ May 27, 2024

difin May 29, 2024

SourabhBadhya Jun 7, 2024

SourabhBadhya Jun 7, 2024

difin Jun 10, 2024

SourabhBadhya left a comment

sonarcloud bot commented Jun 11, 2024

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024 •

edited

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024

deniskuzZ Jun 13, 2024


		public class IcebergCompactionContext {

		public static final String COMPACTION_PART_SPEC_ID = "compaction_part_spec_id";

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

Are you sure you want to change the base?

HIVE-28256: Iceberg: Major QB Compaction on partition level with evol… #5248

Conversation

difin commented May 13, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SourabhBadhya left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Jun 11, 2024

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ Jun 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

difin commented May 13, 2024 •

edited

deniskuzZ Jun 13, 2024 •

edited