Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve field name when casting List #13468

Merged
merged 4 commits into from
Nov 25, 2024

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Nov 18, 2024

Which issue does this PR close?

Closes #13481

This also pertains to delta-io/delta-rs#2886

Rationale for this change

During a cast operation, the field name for a list can be changed to "item" from whatever it is previously. This change adds an option to set the field name.

What changes are included in this PR?

Adds an option for array_into_list_array and the other variants to pass in the field name for the schema.

Are these changes tested?

Unit tests added and tested against the example in #13481

Are there any user-facing changes?

Internal change only. No user facing changes.

@gruuya
Copy link
Contributor

gruuya commented Nov 18, 2024

Thanks for looking into this!

FWIW, I've also been trying to see what the problem is, and the following is a minimal repro (test case) along the lines of the delta-rs test I've been able to come up with

    #[tokio::test]
    async fn test_list_item() -> Result<()> {
        use datafusion_functions_nested::expr_fn::make_array;

        let element_field = Arc::new(Field::new("element", DataType::Int32, true));
        let items_field = Field::new(
            "items",
            DataType::List(element_field.clone()),
            true,
        );
        let schema = Schema::new(vec![items_field.clone()]);

        let mut items_builder =
            ListBuilder::new(Int32Builder::new()).with_field(element_field.clone());
        items_builder.append_value([Some(1)]);
        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(items_builder.finish())])?;

        let ctx = SessionContext::new();
        let df = ctx.read_batch(batch).expect("source DataFrame")
            .with_column("condition", lit(false))?
            .select(vec![case(col("condition")).when(lit(false), make_array(vec![lit(2), lit(3)])).otherwise(col("items"))?.alias("items")])?;


        let _ = df.create_physical_plan().await?;

        Ok(())
    }

This might be a red herring, but I'm also looking at whether the issue arises somewhere in TypeCoercion analyzer, specifically in coerce_case_expression. This is because the CASE clause here is crucial for the repro, as without it the outermost plan doesn't get coerced to the child plan's list type (i.e. the one with items named element), but instead retains the types of the expression itself (with the generic item names).

@findepi
Copy link
Member

findepi commented Nov 18, 2024

During a cast operation, the field name for a list can be changed to "item" from whatever it is previously. This change adds an option to set the field name.

option, or is it always the case?

btw in #12622 we probably would want to have all List(x) to not be discern by their field name. cc @notfilippo @alamb

@timsaucer
Copy link
Contributor Author

@findepi I put it as an option because we also use these same functions to build arrays from things like iterators that don't have an associated field name.

@gruuya
Copy link
Contributor

gruuya commented Nov 19, 2024

I'm also looking at whether the issue arises somewhere in TypeCoercion analyzer

Ok, looks like TypeCoercion works as intended, and it seems like SimplifyExpressions is the culprit behind the particular problem in delta-rs. From the new issue #13481 I've opened:

Basically TypeCoercion injects the CAST, while SimplifyExpressions omits it. Consequently OptimizeProjections folds the two projections, but the schema is computed from the outermost projection expressions, which now lack the proper type info (i.e. the non-standard list item field names).

EDIT: Oh, I see now that the proposed fix in this PR resolves the bug with SimplifyExpressions described in #13481

@timsaucer timsaucer marked this pull request as ready for review November 19, 2024 14:03
@timsaucer
Copy link
Contributor Author

Ok, this should be good to go. I did test it against the examples shown in #13481 also. It does resolve the test_list_item test but not the test_simplify_case_cast_list. I will try to investigate that under a different PR.

@gruuya
Copy link
Contributor

gruuya commented Nov 19, 2024

but not the test_simplify_case_cast_list

Just a note that it should only really make the second assert pass (the one asserting on the output data type), not the first (i.e. the cast should be transmuted into the output data type).

pub fn array_into_list_array(
arr: ArrayRef,
nullable: bool,
field_name: Option<&str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this Option<impl Into<String>> to align with the type in Field::new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can but that means we'd now have to add type annotations to everywhere we call this, including those that are going to pass None because they don't have field elements. I think this would add bloat that doesn't really add much value. If you feel strongly, I'll make the adjustment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair point, let's keep it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is public I don't think we can change its signature in a minor release

https://docs.rs/datafusion/latest/datafusion/common/utils/fn.array_into_list_array.html

Thus I suggest we keep the existing functions but mark them as deprecated and add a new API

Maybe we can do something builder style to make future modifications easier

let list_array = ListArrayWrapper::new(arr)
  .with_nullable(false)
  .with_field(field)

Or something -- that would also give us a way to take the FieldRef directly as well as document what the defaults are 🤔

let offsets = OffsetBuffer::from_lengths([arr.len()]);
ListArray::new(
Arc::new(Field::new_list_field(arr.data_type().to_owned(), nullable)),
Arc::new(Field::new(
field_name.unwrap_or("item"),
Copy link
Contributor

@gruuya gruuya Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not a concern of this PR, but looks as if this naming convention should be formalized/centralized somewhere (at least a constant), since otherwise there's a whole lot of "item" literals throughout the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree -- I think it should be done in arrow-rs

Maybe we could add some comments on https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html#method.new_list or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be done in arrow-rs

Yeah, makes sense—I can pick that up if there's no objection, maybe something like this

impl Field {
+    /// Default list member field name
+    pub const LIST_FIELD_DEFAULT_NAME: &'static str = "item";
+
     /// Creates a new field with the given name, type, and nullability
     pub fn new(name: impl Into<String>, data_type: DataType, nullable: bool) -> Self {
         Field {
@@ -144,7 +147,7 @@ impl Field {
     /// );
     /// ```
     pub fn new_list_field(data_type: DataType, nullable: bool) -> Self {
-        Self::new("item", data_type, nullable)
+        Self::new(Self::LIST_FIELD_DEFAULT_NAME, data_type, nullable)
     }

could add some comments

I think new_list_field already has an adequate comment (Field::new_list probably doesn't need one since the name is explicitly passed there).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the very least having a default constant exposed would be a nice improvement I think

@timsaucer
Copy link
Contributor Author

Are we okay to merge?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good and makes sense to me -- than you @timsaucer and @gruuya

My only concern is that we won't be able to release this change to 43.0.0 as it would be a breaking API change.

let offsets = OffsetBuffer::from_lengths([arr.len()]);
ListArray::new(
Arc::new(Field::new_list_field(arr.data_type().to_owned(), nullable)),
Arc::new(Field::new(
field_name.unwrap_or("item"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the very least having a default constant exposed would be a nice improvement I think

pub fn array_into_list_array(
arr: ArrayRef,
nullable: bool,
field_name: Option<&str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is public I don't think we can change its signature in a minor release

https://docs.rs/datafusion/latest/datafusion/common/utils/fn.array_into_list_array.html

Thus I suggest we keep the existing functions but mark them as deprecated and add a new API

Maybe we can do something builder style to make future modifications easier

let list_array = ListArrayWrapper::new(arr)
  .with_nullable(false)
  .with_field(field)

Or something -- that would also give us a way to take the FieldRef directly as well as document what the defaults are 🤔

…d in a parallel function for the small cases where we want to explicitly set the field name
@timsaucer
Copy link
Contributor Author

Since it would be a breaking change and there is only one place that we internally need this feature, I just switched it to have a different function. I can change over to a builder method if you like, but I wasn't convinced this small use needed it. Happy to support if you think that's valuable.

@@ -342,6 +342,20 @@ pub fn array_into_list_array(arr: ArrayRef, nullable: bool) -> ListArray {
)
}

pub fn array_into_list_array_with_field_name(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the idea that we will backport this to 43.0.0 and then consolidate the other functions on main? That makes sense to me if so

@alamb
Copy link
Contributor

alamb commented Nov 24, 2024

Thank you very much @timsaucer

@timsaucer timsaucer merged commit 2e05648 into apache:main Nov 25, 2024
26 checks passed
findepi pushed a commit to sdf-labs/arrow-datafusion that referenced this pull request Nov 28, 2024
* Add option to pass in field name to create array to support retaining field name during cast

* add unit tests for list casting round trip

* Documentation example was missing parameter

* Rather than deprecate an existing function or change pub signature add in a parallel function for the small cases where we want to explicitly set the field name
findepi pushed a commit to sdf-labs/arrow-datafusion that referenced this pull request Nov 28, 2024
* Add option to pass in field name to create array to support retaining field name during cast

* add unit tests for list casting round trip

* Documentation example was missing parameter

* Rather than deprecate an existing function or change pub signature add in a parallel function for the small cases where we want to explicitly set the field name
@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

Since it would be a breaking change and there is only one place that we internally need this feature, I just switched it to have a different function. I can change over to a builder method if you like, but I wasn't convinced this small use needed it. Happy to support if you think that's valuable.

Here is my attempt to migrate to a builder (and maybe avoid these kinds of issues in the future 🤔 ): #13623

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify expressions swallows a cast expression
4 participants