Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding support for deltalake file #843

Closed
wants to merge 6 commits into from

Conversation

cocoa-xu
Copy link
Member

This PR tries to add support for deltalake file, and should close #752 once it's done.

@philss
Copy link
Member

philss commented Feb 1, 2024

hey @cocoa-xu , thank you for this PR!

I'm going to try to fix the missing parts if you don't mind. I'm going to open another branch to avoid conflicting with yours, but I'm going to be based on this branch, ok?

@cocoa-xu
Copy link
Member Author

cocoa-xu commented Feb 3, 2024

hey @cocoa-xu , thank you for this PR!

I'm going to try to fix the missing parts if you don't mind. I'm going to open another branch to avoid conflicting with yours, but I'm going to be based on this branch, ok?

No problem!

@josevalim
Copy link
Member

@cocoa-xu could we perhaps use a library that converts delta lake to an arrow stream, and then we can use the existing logic in df_from_arrow_stream_pointer to convert it to a dataframe?

Alternatively, we could even have delta lake as a separate library, and all it need to do is to expose a pointer to arrow stream, similar to what we do in Adbc. In Adbc we do:

  Adbc.Connection.query_pointer(conn, query, params, fn pointer, _num_rows ->
    Explorer.PolarsBackend.Native.df_from_arrow_stream_pointer(pointer)
  end)

But we could export df_from_arrow_stream_pointer(pointer) as a public function? This way you could do:

  DeltaLake.transfer_arrow_stream_pointer(lake, fn pointer ->
    Explorer.PolarsBackend.from_arrow_stream_pointer(pointer)
  end)

WDYT?

@cocoa-xu
Copy link
Member Author

@cocoa-xu could we perhaps use a library that converts delta lake to an arrow stream, and then we can use the existing logic in df_from_arrow_stream_pointer to convert it to a dataframe?

Alternatively, we could even have delta lake as a separate library, and all it need to do is to expose a pointer to arrow stream, similar to what we do in Adbc.

Yeah, I think perhaps I'll try to have delta lake as a separate library, at least it would be a good start point, and probably easier to debug (and possibly integrate it into explorer if it won't cause too much trouble to do so)

@cocoa-xu
Copy link
Member Author

So I read more in the deltalake repo and other relevant repos, and I found that we basically have to wait for the upstream to support ADBC.

According to delta-rs issue #954 and the design doc of Delta Lake ADBC,

We would aim to use this to support integration with DuckDB and Polars.

Polars has a different problem: it uses arrow2, which is a separate Arrow implementation from apache/arrow-rs (used by DataFusion).

This was one of the difficulties I encountered here -- there's no easy way to cast from polars_arrow's array to the one deltalake is using, i.e., apache/arrow-rs' array, or the other way around.

Currently, I think we should wait for

Once they're available, we can easily add support for deltalake to explorer.

Besides that, we can try to replace current ADBC binding for Elixir https://github.com/elixir-explorer/adbc with the Rust version of ADBC if we want to and if it's possible/suitable for our use cases.

@josevalim
Copy link
Member

ADBC sounds like a solid route forward. Shall we close this one then?

@cocoa-xu
Copy link
Member Author

ADBC sounds like a solid route forward. Shall we close this one then?

Yeah I think we can close this one and wait for the ADBC version for it.

@cocoa-xu cocoa-xu closed this Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add delta lake file support
3 participants