Why shouldn't we return an `UnboundPartitionSpec` instead? #694

Xuanwo · 2024-11-11T16:09:20Z

          I'm not convinced that we need another type, such as `SchemalessPartitionSpec`. Why shouldn't we return an `UnboundPartitionSpec` instead? Looking at the signatures:

pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}

pub struct SchemalessPartitionSpec {
    /// Identifier for PartitionSpec
    spec_id: i32,
    /// Details of the partition spec
    fields: Vec<PartitionField>,
}

And the PartitionField and UnboundPartitionField are basically the same.

Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.

If we cannot find the field anymore, the best thing to do is to include the file in the query plan.

Originally posted by @Fokko in #645 (comment)

The text was updated successfully, but these errors were encountered:

Xuanwo · 2024-11-11T16:10:12Z

Hi, @Fokko, I created an issue to make it easier for us to continue the discussion.

Also cc @liurenjie1024 and @c-thiel.

Fokko · 2024-11-11T16:53:50Z

@Xuanwo Thanks, appreciate it. I'm also preparing a PR for Java that illustrates the solution more clearly.

liurenjie1024 · 2024-11-12T01:37:38Z

I asked same question, and here is the answer from @c-thiel :
#645 (comment)

And the reason I this is reasonable is here: #645 (comment)

liurenjie1024 · 2024-11-12T01:44:32Z

Here is the reason why I accept it:

SchemalessPartitionSpec is used when we load table metadata and build partition spec from table metadata. Since there is no schema id associated with partition spec, in java it's built with bindUncheck with current schema in current snapshot. That's to say, java implementation assumes that the partiton specs are valid with current schema without checking. I'm skeptical that it's a safe approach.

I agree with @c-thiel that it's not appropriate to use UnboundPartitionSpec since UnboundPartitionSpec means not validated, user constructed partition spec, which is not the case for partiton specs from table metadata.

c-thiel · 2024-11-13T16:39:58Z

I think I have most of my motivation in #645 (comment) and the comments referenced above. I follow the argument from Renjie.
For me UnboundPartitionSpec has a different semantic than a previously bound spec, manifested also in the Optional Field ID. Using UnboundPartitionSpec in TableMetadata would introduce lines of error handling in other places, where we expect the FieldId to be present. As field_id is a required field in the spec for TableMetadata, we would have to use SchemalessPartitionSpec at least to deserialize the json before transforming it to UnboundPartitionSpec.

c-thiel · 2024-11-16T10:46:11Z

@liurenjie1024, @Fokko, @Xuanwo we should probably come to a conclusion here before the next release.
@Fokko is there already something in your Java PR which we can peek at? What do you have in mind for a solution?

Fokko · 2024-11-16T15:00:02Z

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

Regarding #645 (comment) there are some incorrect assumptions there:

If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important field_id Optional.

I don't understand this statement. If I look at the UnboundPartitionSpec:

#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionField {
    /// A source column id from the table’s schema
    pub source_id: i32,
    /// A partition field id that is used to identify a partition field and is unique within a partition spec.
    /// In v2 table metadata, it is unique across all partition specs.
    #[builder(default, setter(strip_option))]
    pub field_id: Option<i32>,
    /// A partition name.
    pub name: String,
    /// A transform that is applied to the source column to produce a partition value.
    pub transform: Transform,
}

/// Unbound partition spec can be built without a schema and later bound to a schema.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

So going from the problem space to the solution space. I was looking at the Java code and the PyIceberg code, and at PyIceberg we took a slightly different approach that might be interesting for Iceberg-rust as well. As mentioned earlier, the source ID might evolve (assume compatible ones int -> long, float -> double, etc) over time. The most obvious example is the identity-partition where the sourceType is equal to the resultType. Therefore we allow binding with different schema's in PyIceberg: https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225 I think that might be a better approach for Rust as well, as this is now hard-coded:

iceberg-rust/crates/iceberg/src/spec/partition.rs

Lines 60 to 70 in 5034519

    
           #[derive(Debug, PartialEq, Eq, Clone)] 
        
           pub struct BoundPartitionSpec { 
        
               /// Identifier for PartitionSpec 
        
               spec_id: i32, 
        
               /// Details of the partition spec 
        
               fields: Vec<PartitionField>, 
        
               /// The schema this partition spec is bound to 
        
               schema: SchemaRef, 
        
               /// Type of the partition spec 
        
               partition_type: StructType, 
        
           }

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored. The downside is that with strongly typed languages, we need to know the type upfront, and otherwise we only know it when we read the actual Avro file (that has the schema with the type). Therefore I'm playing around with a solution for this on the Java side: apache/iceberg#11542

I hope this helps!

c-thiel · 2024-11-16T23:31:44Z

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

They are not quite - a previously bound schema has a required field_id (reference to PartitionField), while unbound has an optional field_id (Option<i32>) as it references UnboundPartitionField. TableMetadata.partition_specs requires field_id, so we shouldn't use UnboundPartitionSpec there.

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

True, that was too strict. What I should have expressed it that it is not guaranteed to be compatible with the current schema, which is why Java uses bindUnchecked().
I think with the current implementation we are quite close to what you propose - binding a SchemalessPartitionSpec or UnboundPartitionSpec to a new schema is easily possible by using spec.to_unbound().bind(new_schema). Currently this binding can fail. What is missing is the notion that you described with

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored

This would require a more permissive bind method, or an additional flag, that would ignore unknown fields, similar to what you try to incorporate into Java with Any. The important point is that dynamic re-binding is already possible today.

The BoundPartitionSpec serves two additional purposes - it is a way to compute the binding and store the result typesafe. This is useful for

Pre-computing the bind, as we currently do for default_spec against the current_schema - which is the most important combination. I don't see why we shouldn't store the resulting type as we need to assert compatibility anyway for this combination.
Have the means to get the source field names for a partition specs. This is useful when a spec is transferred from one context to another. In Java this is used for addPartitionSpec which eventually runs this code: https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L768. Note that currently we don't have this semantic in the builder, which in java performs name-mapping to potentially change the source_id. Instead we pass an UnboundSpec which trusts the provides source_id and not the field name.

Summing it up, my approach would be to keep all three types - I simply don't see a good way around them.
However, we should add a more permissive bind method that incorporates better the semantic of binding to an incompatible schema. This would be an alternative to the bindUnchecked() in Java. Maybe we can just follow the PR you are working on in Java once that's ready. This method would not necessarily be able to determine a type for each field in a spec.

c-thiel · 2024-11-30T10:23:54Z

@liurenjie1024, @Xuanwo, @Fokko may I ask for another round of Feedback for this?
I believe if we decide to not introduce a SchemalessPartitionSpec, we should do so before 0.4.0

c-thiel mentioned this issue Nov 30, 2024

Tracking issues of iceberg rust v0.4.0 Release #739

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why shouldn't we return an `UnboundPartitionSpec` instead? #694

Why shouldn't we return an `UnboundPartitionSpec` instead? #694

Xuanwo commented Nov 11, 2024

Xuanwo commented Nov 11, 2024

Fokko commented Nov 11, 2024

liurenjie1024 commented Nov 12, 2024

liurenjie1024 commented Nov 12, 2024

c-thiel commented Nov 13, 2024

c-thiel commented Nov 16, 2024

Fokko commented Nov 16, 2024 •

edited

Loading

c-thiel commented Nov 16, 2024 •

edited

Loading

c-thiel commented Nov 30, 2024

Why shouldn't we return an UnboundPartitionSpec instead? #694

Why shouldn't we return an UnboundPartitionSpec instead? #694

Comments

Xuanwo commented Nov 11, 2024

Xuanwo commented Nov 11, 2024

Fokko commented Nov 11, 2024

liurenjie1024 commented Nov 12, 2024

liurenjie1024 commented Nov 12, 2024

c-thiel commented Nov 13, 2024

c-thiel commented Nov 16, 2024

Fokko commented Nov 16, 2024 • edited Loading

c-thiel commented Nov 16, 2024 • edited Loading

c-thiel commented Nov 30, 2024

Why shouldn't we return an `UnboundPartitionSpec` instead? #694

Why shouldn't we return an `UnboundPartitionSpec` instead? #694

Fokko commented Nov 16, 2024 •

edited

Loading

c-thiel commented Nov 16, 2024 •

edited

Loading