Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch rows in JSON format #152

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

pravic
Copy link
Contributor

@pravic pravic commented Sep 17, 2024

Summary

This is a draft PR which adds the support for fetching rows in JSON format.

The current Query::fetch fetches the results using FORMAT RowBinary only which requires a strict 1:1 mapping between the query schema and the deserialized type.
And that has a lot of limitations:

This PR allows the row deserialization into a T: Deserialize which eliminates the limitations of Query::fetch:

  • when the table schema is not known: SELECT * from ?
  • when the table schema is not specified: DESCRIBE TABLE ?
  • when we read less columns than we select

Unresolved questions

  • am aware (well, now) of https://github.com/suharev7/clickhouse-rs but while that library lifts some of the limitations, it's very different from this one and I'd rather stick to this in hope that Use RowBinaryWithNamesAndTypes #10 will eventually be supported
  • the naming Query::json, Query::json_one, Query::json_all is very unfortunate
    • maybe, Query::fetch_json, Query::fetch_json_one would be better albeit verbose
  • Query::json requires the watch feature which is a bit misleading - we can improve this

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided so that we can include it in CHANGELOG later
  • For significant changes, documentation in README and https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

@CLAassistant
Copy link

CLAassistant commented Sep 17, 2024

CLA assistant check
All committers have signed the CLA.

@slvrtrn
Copy link
Contributor

slvrtrn commented Sep 18, 2024

Could it be a more flexible approach to provide a "raw" method such as

pub fn query_raw(format: DataFormat) -> Result<RawCursor>

So that then it is possible to extract either raw bytes or a JSON repr from that RawCursor? This way, it will be possible to support all JSON types and streaming into files with CSV/TSV/Parquet, etc.

WDYT?

CC @serprex, @loyd, @mshustov

@slvrtrn
Copy link
Contributor

slvrtrn commented Sep 18, 2024

Let's extract Debug impl for Query as a separate PR as it addresses #146 (thank you!), so that it can be merged quickly, while we are discussing the rest.

@serprex serprex requested a review from slvrtrn September 18, 2024 18:36
serprex
serprex previously approved these changes Sep 18, 2024
@serprex
Copy link
Member

serprex commented Sep 18, 2024

+1 on naming json fetch_json etc

@pravic
Copy link
Contributor Author

pravic commented Sep 18, 2024

Could it be a more flexible approach to provide a "raw" method such as

@slvrtrn That would be useful on its own as well - see #154, #60, etc.

@pravic pravic marked this pull request as ready for review September 19, 2024 07:50
@loyd
Copy link
Collaborator

loyd commented Sep 20, 2024

First of all, the only reason why this crate already contains (conditionally!) serde-json is only WA for WATCH, which will go away with moving to the Native format.

Secondly, the initial problem is the absence of dynamic schemas, not the "I want exactly JSON" problem.

Thus, we should avoid any additional formats that should be covered by basic cursor-like API (Apache Arrow can have another column-based API, for instance). I want to remember that the semantics of the current query() API is not "give me RowBinary". It's about providing Rust structures. RowJsonCursor<T> seems very unobvious to me. JSON is a string in some predefined format and it isn't related to any T.

By providing such specific interfaces, we're opening a black hole: the next request will be "I want CSV format", "I want Vertical (lol) format", and so on. All these formats will be difficult to support later, because Bytes -> T deserialization can be very differently implemented (think about Native or any other column-based formats). So, I think that Cursor<T> should have only one possible implementation (or several ones if it's required for some sort of WA), but only as implementation details. Yep, it can even use JSON internally if the user wants schemas not supported by RowBinary, but it's not "I want JSON"; it's "I want dynamic schemas" flag/inference.

@slvrtrn's suggestion about query_raw seems much more flexible and doesn't lead to new dependencies. But, again, it doesn't provide any -> T deserialization, only async iterator over chunks. And it's great: if a user wants to parse it on his own, it's okay.

@serprex serprex requested a review from loyd September 20, 2024 17:33
@pravic
Copy link
Contributor Author

pravic commented Sep 20, 2024

Secondly, the initial problem is the absence of dynamic schemas

Correct. I stated the original problem: it's impossible right now to get the result of DESCRIBE TABLE using clickhouse-rs. And #10 is 3 years old already.

Of course, query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when? If another 3 years, then it's better to have at least something that unblocks use cases similar to #10.

If only we had a working PostgreSQL interface, we'd stick with sqlx. Or a native sqlx-clickhouse driver would be even more perfect.

@loyd
Copy link
Collaborator

loyd commented Sep 21, 2024

query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when?

Actually, I think we can implement query_raw pretty fast. At least, I don't see any issues with it now.

If another 3 years

I totally understand your displeasure. There are several points why it wasn't resolved in time:

  • This crate was a personal initiative during my free time so that I couldn't solve any more-than-trivial issues unrelated to my needs.
  • Moving to RowBinaryWithNamesAndTypes seems to result in a performance penalty or unmaintainable code. Native could potentially solve these problems, but it's undocumented and, formally, unstable.

Although, we can use NamesAndTypes only for deserialize_any (dynamic schemas) and probably it won't affect anything.

I hope both reasons become irrelevant now that ClickHouse's team members have begun working on this crate.

Or a native sqlx-clickhouse driver would be even more perfect.

Maybe yes, maybe not. It's not a simple statement. I liked the SQLx-like approach much more until we (I mean, in my team) started actively using it (for PostgreSQL), and.. ok, let's say that it's full of pain for multiple nontrivial tasks. But I don't want to delve into this topic here, it's worth considering after the sqlx support by detailed comparison and benchmarking. I'll be totally fine if CH env team decides to focus on sqlx instead as the official approach.


So, query_raw()? Anyway it's useful to have in this crate.

@pravic
Copy link
Contributor Author

pravic commented Sep 22, 2024

So, query_raw()?

@loyd All right. Thanks for the very detailed explanation about the background of this!

Print the source JSON string in the error message.
Allows row deserialization into a `T: Deserialize`,
which eliminates the limitations of `Query::fetch`:

* when the table schema is not known: `SELECT * from ?`
* when the table schema is not specified: `DESCRIBE TABLE ?`
* when we read less columns than we select
@slvrtrn
Copy link
Contributor

slvrtrn commented Nov 28, 2024

@pravic, do you think something like this could work instead - #182? That draft also contains potential usage examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants