Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why shouldn't we return an UnboundPartitionSpec instead? #694

Open
Xuanwo opened this issue Nov 11, 2024 · 9 comments
Open

Why shouldn't we return an UnboundPartitionSpec instead? #694

Xuanwo opened this issue Nov 11, 2024 · 9 comments

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Nov 11, 2024

          I'm not convinced that we need another type, such as `SchemalessPartitionSpec`. Why shouldn't we return an `UnboundPartitionSpec` instead? Looking at the signatures:
pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}
pub struct SchemalessPartitionSpec {
    /// Identifier for PartitionSpec
    spec_id: i32,
    /// Details of the partition spec
    fields: Vec<PartitionField>,
}

And the PartitionField and UnboundPartitionField are basically the same.

Introducing SchemalessPartitionSpec might be our way to avoid apache/iceberg#4563.

If we cannot find the field anymore, the best thing to do is to include the file in the query plan.

Originally posted by @Fokko in #645 (comment)

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 11, 2024

Hi, @Fokko, I created an issue to make it easier for us to continue the discussion.

Also cc @liurenjie1024 and @c-thiel.

@Fokko
Copy link
Contributor

Fokko commented Nov 11, 2024

@Xuanwo Thanks, appreciate it. I'm also preparing a PR for Java that illustrates the solution more clearly.

@liurenjie1024
Copy link
Contributor

I asked same question, and here is the answer from @c-thiel :
#645 (comment)

And the reason I this is reasonable is here: #645 (comment)

@liurenjie1024
Copy link
Contributor

Here is the reason why I accept it:

SchemalessPartitionSpec is used when we load table metadata and build partition spec from table metadata. Since there is no schema id associated with partition spec, in java it's built with bindUncheck with current schema in current snapshot. That's to say, java implementation assumes that the partiton specs are valid with current schema without checking. I'm skeptical that it's a safe approach.

I agree with @c-thiel that it's not appropriate to use UnboundPartitionSpec since UnboundPartitionSpec means not validated, user constructed partition spec, which is not the case for partiton specs from table metadata.

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 13, 2024

I think I have most of my motivation in #645 (comment) and the comments referenced above. I follow the argument from Renjie.
For me UnboundPartitionSpec has a different semantic than a previously bound spec, manifested also in the Optional Field ID. Using UnboundPartitionSpec in TableMetadata would introduce lines of error handling in other places, where we expect the FieldId to be present. As field_id is a required field in the spec for TableMetadata, we would have to use SchemalessPartitionSpec at least to deserialize the json before transforming it to UnboundPartitionSpec.

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 16, 2024

@liurenjie1024, @Fokko, @Xuanwo we should probably come to a conclusion here before the next release.
@Fokko is there already something in your Java PR which we can peek at? What do you have in mind for a solution?

@Fokko
Copy link
Contributor

Fokko commented Nov 16, 2024

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

Regarding #645 (comment) there are some incorrect assumptions there:

If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important field_id Optional.

I don't understand this statement. If I look at the UnboundPartitionSpec:

#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionField {
    /// A source column id from the table’s schema
    pub source_id: i32,
    /// A partition field id that is used to identify a partition field and is unique within a partition spec.
    /// In v2 table metadata, it is unique across all partition specs.
    #[builder(default, setter(strip_option))]
    pub field_id: Option<i32>,
    /// A partition name.
    pub name: String,
    /// A transform that is applied to the source column to produce a partition value.
    pub transform: Transform,
}

/// Unbound partition spec can be built without a schema and later bound to a schema.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

So going from the problem space to the solution space. I was looking at the Java code and the PyIceberg code, and at PyIceberg we took a slightly different approach that might be interesting for Iceberg-rust as well. As mentioned earlier, the source ID might evolve (assume compatible ones int -> long, float -> double, etc) over time. The most obvious example is the identity-partition where the sourceType is equal to the resultType. Therefore we allow binding with different schema's in PyIceberg: https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225 I think that might be a better approach for Rust as well, as this is now hard-coded:

#[derive(Debug, PartialEq, Eq, Clone)]
pub struct BoundPartitionSpec {
/// Identifier for PartitionSpec
spec_id: i32,
/// Details of the partition spec
fields: Vec<PartitionField>,
/// The schema this partition spec is bound to
schema: SchemaRef,
/// Type of the partition spec
partition_type: StructType,
}

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored. The downside is that with strongly typed languages, we need to know the type upfront, and otherwise we only know it when we read the actual Avro file (that has the schema with the type). Therefore I'm playing around with a solution for this on the Java side: apache/iceberg#11542

I hope this helps!

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 16, 2024

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

They are not quite - a previously bound schema has a required field_id (reference to PartitionField), while unbound has an optional field_id (Option<i32>) as it references UnboundPartitionField. TableMetadata.partition_specs requires field_id, so we shouldn't use UnboundPartitionSpec there.

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

True, that was too strict. What I should have expressed it that it is not guaranteed to be compatible with the current schema, which is why Java uses bindUnchecked().
I think with the current implementation we are quite close to what you propose - binding a SchemalessPartitionSpec or UnboundPartitionSpec to a new schema is easily possible by using spec.to_unbound().bind(new_schema). Currently this binding can fail. What is missing is the notion that you described with

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored

This would require a more permissive bind method, or an additional flag, that would ignore unknown fields, similar to what you try to incorporate into Java with Any. The important point is that dynamic re-binding is already possible today.

The BoundPartitionSpec serves two additional purposes - it is a way to compute the binding and store the result typesafe. This is useful for

  1. Pre-computing the bind, as we currently do for default_spec against the current_schema - which is the most important combination. I don't see why we shouldn't store the resulting type as we need to assert compatibility anyway for this combination.
  2. Have the means to get the source field names for a partition specs. This is useful when a spec is transferred from one context to another. In Java this is used for addPartitionSpec which eventually runs this code: https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L768. Note that currently we don't have this semantic in the builder, which in java performs name-mapping to potentially change the source_id. Instead we pass an UnboundSpec which trusts the provides source_id and not the field name.

Summing it up, my approach would be to keep all three types - I simply don't see a good way around them.
However, we should add a more permissive bind method that incorporates better the semantic of binding to an incompatible schema. This would be an alternative to the bindUnchecked() in Java. Maybe we can just follow the PR you are working on in Java once that's ready. This method would not necessarily be able to determine a type for each field in a spec.

@c-thiel
Copy link
Collaborator

c-thiel commented Nov 30, 2024

@liurenjie1024, @Xuanwo, @Fokko may I ask for another round of Feedback for this?
I believe if we decide to not introduce a SchemalessPartitionSpec, we should do so before 0.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants