Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: upgrade to datafusion 43 #2886

Merged
merged 14 commits into from
Nov 21, 2024
Merged

Conversation

ion-elgreco
Copy link
Collaborator

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Sep 14, 2024
@rtyler
Copy link
Member

rtyler commented Sep 16, 2024

We need Datafusion to upgrade as well before we can do this. I have some patches to make the latest datafusion work but we're blocked on them to adopt the newest arrow and kernel

@rtyler rtyler self-assigned this Sep 16, 2024
@rtyler rtyler marked this pull request as draft September 16, 2024 16:24
@Xuanwo
Copy link
Contributor

Xuanwo commented Sep 18, 2024

Hi, datafusion has been released. Let's rock!

@alamb
Copy link
Contributor

alamb commented Sep 20, 2024

Thank you @ion-elgreco -- let me know if there is anything I can to do help with this PR (e.g. I could work on updating the code to use non deprecated APIs in the core for parquet statistics...)

@ion-elgreco
Copy link
Collaborator Author

Hey @alamb, I am not working on this atm, but @rtyler has picked it up from here

@rtyler rtyler changed the title chore: bump kernel chore: upgrade to datafusion 42 Sep 20, 2024
@rtyler rtyler force-pushed the chore/bump_kernel branch 2 times, most recently from 8db59c2 to b325e27 Compare September 20, 2024 12:35
@ion-elgreco
Copy link
Collaborator Author

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

I had a quick glance at the Changelog but can't really pinpoint where this change may have occurred, do you have any ideas?

DFSchema { inner: 
Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
new schema: 
DFSchema { inner: Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },

@alamb
Copy link
Contributor

alamb commented Sep 20, 2024

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

It sounds somewhat similar to a discussion happening here about nullability (but I think field name is similar in terms of how it is treated): apache/datafusion#11989 (comment)

Update; filed apache/datafusion#12560 to track the regression more

@rtyler
Copy link
Member

rtyler commented Sep 20, 2024

@alamb thanks for following up! I feel a little sheepish 🐑 insofar that my internal CI showed these errors weeks ago in my integration testing between datafusion main with delta-rs main, I just didn't have the time to triage them like @ion-elgreco has 😄

I've subscribed to the issue you created and can help testing if necessary! Depending on my availability this weekend I can try fixing it in datafusion too 🤔

@matthewmturner
Copy link
Contributor

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

@ion-elgreco
Copy link
Collaborator Author

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

It's not yet released though

@rtyler
Copy link
Member

rtyler commented Oct 7, 2024

I can confirm that the latest datafusion main passes tests in CI 🥳

@rtyler rtyler force-pushed the chore/bump_kernel branch from 4f30e16 to bac3306 Compare October 7, 2024 23:10
@rtyler rtyler changed the title chore: upgrade to datafusion 42 chore: upgrade to datafusion 43 Oct 8, 2024
@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

fix

I filed apache/datafusion#12813 to consider releasing a patch version of DataFusion

@alamb
Copy link
Contributor

alamb commented Nov 20, 2024

We plan to make another DataFusion patch release to unblock this upgrade:

Today the make_array function from Datafusion uses "item" as the list
element's field name. With recent changes in delta-kernel-rs we have
switched to calling it "element" which is more conventional related to
how Apache Parquet handles things

This change introduces a test which helps isolate the behavior seen in
Python tests within the core crate for easier regression testing

Signed-off-by: R. Tyler Croy <[email protected]>
@rtyler rtyler disabled auto-merge November 21, 2024 00:52
@rtyler rtyler merged commit 25c6c6f into delta-io:main Nov 21, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

its-happening

@alamb
Copy link
Contributor

alamb commented Dec 13, 2024

👋 hey there -- we are starting to get ready for the DataFusion 44 release and are trying to make sure it is less disruptive to delta-rs than this past release

@rtyler Would it be ok if we started working on the upgrade now before the DataFusion release to test and find any potential breakages before?

Maybe we can make a WIP PR to this repo with a crates.io pin (similar to what i am doing in DataFusion with arrow-rs):

If this sounds ok, I can start working on that PR

@roeap
Copy link
Collaborator

roeap commented Dec 13, 2024

hey @alamb - that would not only be OK, but a great thing to do :). Happy to support if needed along the way.

@alamb
Copy link
Contributor

alamb commented Dec 13, 2024

Cool -- thanks -- I'll give it a try hopefully in the next day or two. I am sure I will need help with the python stuff

@alamb
Copy link
Contributor

alamb commented Dec 18, 2024

I haven't made any actual progress on this, but I have created a ticket to track the work:

@rtyler
Copy link
Member

rtyler commented Dec 21, 2024

@alamb I had not seen this issue last week, sorry! I maintain a branch for customers with datafusion main on it, I will cherry-pick and queue up the delta-rs modifications I have shortly

@alamb
Copy link
Contributor

alamb commented Dec 21, 2024

@alamb I had not seen this issue last week, sorry! I maintain a branch for customers with datafusion main on it, I will cherry-pick and queue up the delta-rs modifications I have shortly

No worries at all -- thank you 🙏

I was actually about to start work on trying to upgrade DataFuson in delta.rs this morning (I have a few other things to do first) but if you have already done the work that is amazing. Thank you!!!!

@rtyler
Copy link
Member

rtyler commented Dec 21, 2024

#3073

@alamb
Copy link
Contributor

alamb commented Dec 21, 2024

#3073

😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema metadata descriptions cause write failures in deltalake >= 0.19.0
9 participants