-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve field name when casting List #13468
Conversation
… field name during cast
Thanks for looking into this! FWIW, I've also been trying to see what the problem is, and the following is a minimal repro (test case) along the lines of the delta-rs test I've been able to come up with #[tokio::test]
async fn test_list_item() -> Result<()> {
use datafusion_functions_nested::expr_fn::make_array;
let element_field = Arc::new(Field::new("element", DataType::Int32, true));
let items_field = Field::new(
"items",
DataType::List(element_field.clone()),
true,
);
let schema = Schema::new(vec![items_field.clone()]);
let mut items_builder =
ListBuilder::new(Int32Builder::new()).with_field(element_field.clone());
items_builder.append_value([Some(1)]);
let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(items_builder.finish())])?;
let ctx = SessionContext::new();
let df = ctx.read_batch(batch).expect("source DataFrame")
.with_column("condition", lit(false))?
.select(vec![case(col("condition")).when(lit(false), make_array(vec![lit(2), lit(3)])).otherwise(col("items"))?.alias("items")])?;
let _ = df.create_physical_plan().await?;
Ok(())
} This might be a red herring, but I'm also looking at whether the issue arises somewhere in |
option, or is it always the case? btw in #12622 we probably would want to have all |
@findepi I put it as an option because we also use these same functions to build arrays from things like iterators that don't have an associated field name. |
Ok, looks like
EDIT: Oh, I see now that the proposed fix in this PR resolves the bug with |
Ok, this should be good to go. I did test it against the examples shown in #13481 also. It does resolve the |
Just a note that it should only really make the second assert pass (the one asserting on the output data type), not the first (i.e. the cast should be transmuted into the output data type). |
datafusion/common/src/utils/mod.rs
Outdated
pub fn array_into_list_array( | ||
arr: ArrayRef, | ||
nullable: bool, | ||
field_name: Option<&str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this Option<impl Into<String>>
to align with the type in Field::new
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can but that means we'd now have to add type annotations to everywhere we call this, including those that are going to pass None
because they don't have field elements. I think this would add bloat that doesn't really add much value. If you feel strongly, I'll make the adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair point, let's keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this function is public I don't think we can change its signature in a minor release
https://docs.rs/datafusion/latest/datafusion/common/utils/fn.array_into_list_array.html
Thus I suggest we keep the existing functions but mark them as deprecated and add a new API
Maybe we can do something builder
style to make future modifications easier
let list_array = ListArrayWrapper::new(arr)
.with_nullable(false)
.with_field(field)
Or something -- that would also give us a way to take the FieldRef directly as well as document what the defaults are 🤔
datafusion/common/src/utils/mod.rs
Outdated
let offsets = OffsetBuffer::from_lengths([arr.len()]); | ||
ListArray::new( | ||
Arc::new(Field::new_list_field(arr.data_type().to_owned(), nullable)), | ||
Arc::new(Field::new( | ||
field_name.unwrap_or("item"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not a concern of this PR, but looks as if this naming convention should be formalized/centralized somewhere (at least a constant), since otherwise there's a whole lot of "item"
literals throughout the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree -- I think it should be done in arrow-rs
Maybe we could add some comments on https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html#method.new_list or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be done in arrow-rs
Yeah, makes sense—I can pick that up if there's no objection, maybe something like this
impl Field {
+ /// Default list member field name
+ pub const LIST_FIELD_DEFAULT_NAME: &'static str = "item";
+
/// Creates a new field with the given name, type, and nullability
pub fn new(name: impl Into<String>, data_type: DataType, nullable: bool) -> Self {
Field {
@@ -144,7 +147,7 @@ impl Field {
/// );
/// ```
pub fn new_list_field(data_type: DataType, nullable: bool) -> Self {
- Self::new("item", data_type, nullable)
+ Self::new(Self::LIST_FIELD_DEFAULT_NAME, data_type, nullable)
}
could add some comments
I think new_list_field
already has an adequate comment (Field::new_list
probably doesn't need one since the name is explicitly passed there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the very least having a default constant exposed would be a nice improvement I think
Are we okay to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good and makes sense to me -- than you @timsaucer and @gruuya
My only concern is that we won't be able to release this change to 43.0.0 as it would be a breaking API change.
datafusion/common/src/utils/mod.rs
Outdated
let offsets = OffsetBuffer::from_lengths([arr.len()]); | ||
ListArray::new( | ||
Arc::new(Field::new_list_field(arr.data_type().to_owned(), nullable)), | ||
Arc::new(Field::new( | ||
field_name.unwrap_or("item"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the very least having a default constant exposed would be a nice improvement I think
datafusion/common/src/utils/mod.rs
Outdated
pub fn array_into_list_array( | ||
arr: ArrayRef, | ||
nullable: bool, | ||
field_name: Option<&str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this function is public I don't think we can change its signature in a minor release
https://docs.rs/datafusion/latest/datafusion/common/utils/fn.array_into_list_array.html
Thus I suggest we keep the existing functions but mark them as deprecated and add a new API
Maybe we can do something builder
style to make future modifications easier
let list_array = ListArrayWrapper::new(arr)
.with_nullable(false)
.with_field(field)
Or something -- that would also give us a way to take the FieldRef directly as well as document what the defaults are 🤔
…d in a parallel function for the small cases where we want to explicitly set the field name
Since it would be a breaking change and there is only one place that we internally need this feature, I just switched it to have a different function. I can change over to a builder method if you like, but I wasn't convinced this small use needed it. Happy to support if you think that's valuable. |
@@ -342,6 +342,20 @@ pub fn array_into_list_array(arr: ArrayRef, nullable: bool) -> ListArray { | |||
) | |||
} | |||
|
|||
pub fn array_into_list_array_with_field_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the idea that we will backport this to 43.0.0 and then consolidate the other functions on main? That makes sense to me if so
Thank you very much @timsaucer |
* Add option to pass in field name to create array to support retaining field name during cast * add unit tests for list casting round trip * Documentation example was missing parameter * Rather than deprecate an existing function or change pub signature add in a parallel function for the small cases where we want to explicitly set the field name
* Add option to pass in field name to create array to support retaining field name during cast * add unit tests for list casting round trip * Documentation example was missing parameter * Rather than deprecate an existing function or change pub signature add in a parallel function for the small cases where we want to explicitly set the field name
Here is my attempt to migrate to a builder (and maybe avoid these kinds of issues in the future 🤔 ): #13623 |
Which issue does this PR close?
Closes #13481
This also pertains to delta-io/delta-rs#2886
Rationale for this change
During a cast operation, the field name for a list can be changed to "item" from whatever it is previously. This change adds an option to set the field name.
What changes are included in this PR?
Adds an option for
array_into_list_array
and the other variants to pass in the field name for the schema.Are these changes tested?
Unit tests added and tested against the example in #13481
Are there any user-facing changes?
Internal change only. No user facing changes.