Fetch rows in JSON format #152

pravic · 2024-09-17T23:02:07Z

Summary

This is a draft PR which adds the support for fetching rows in JSON format.

The current Query::fetch fetches the results using FORMAT RowBinary only which requires a strict 1:1 mapping between the query schema and the deserialized type.
And that has a lot of limitations:

This PR allows the row deserialization into a T: Deserialize which eliminates the limitations of Query::fetch:

when the table schema is not known: SELECT * from ?
when the table schema is not specified: DESCRIBE TABLE ?
when we read less columns than we select

Unresolved questions

am aware (well, now) of https://github.com/suharev7/clickhouse-rs but while that library lifts some of the limitations, it's very different from this one and I'd rather stick to this in hope that Use RowBinaryWithNamesAndTypes #10 will eventually be supported
the naming Query::json, Query::json_one, Query::json_all is very unfortunate
- maybe, Query::fetch_json, Query::fetch_json_one would be better albeit verbose
Query::json requires the watch feature which is a bit misleading - we can improve this

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided so that we can include it in CHANGELOG later
For significant changes, documentation in README and https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

CLAassistant · 2024-09-17T23:02:13Z

All committers have signed the CLA.

slvrtrn · 2024-09-18T16:49:45Z

Could it be a more flexible approach to provide a "raw" method such as

pub fn query_raw(format: DataFormat) -> Result<RawCursor>

So that then it is possible to extract either raw bytes or a JSON repr from that RawCursor? This way, it will be possible to support all JSON types and streaming into files with CSV/TSV/Parquet, etc.

WDYT?

CC @serprex, @loyd, @mshustov

slvrtrn · 2024-09-18T16:51:57Z

Let's extract Debug impl for Query as a separate PR as it addresses #146 (thank you!), so that it can be merged quickly, while we are discussing the rest.

serprex · 2024-09-18T18:38:29Z

+1 on naming json fetch_json etc

pravic · 2024-09-18T21:00:43Z

Could it be a more flexible approach to provide a "raw" method such as

@slvrtrn That would be useful on its own as well - see #154, #60, etc.

loyd · 2024-09-20T17:06:40Z

First of all, the only reason why this crate already contains (conditionally!) serde-json is only WA for WATCH, which will go away with moving to the Native format.

Secondly, the initial problem is the absence of dynamic schemas, not the "I want exactly JSON" problem.

Thus, we should avoid any additional formats that should be covered by basic cursor-like API (Apache Arrow can have another column-based API, for instance). I want to remember that the semantics of the current query() API is not "give me RowBinary". It's about providing Rust structures. RowJsonCursor<T> seems very unobvious to me. JSON is a string in some predefined format and it isn't related to any T.

By providing such specific interfaces, we're opening a black hole: the next request will be "I want CSV format", "I want Vertical (lol) format", and so on. All these formats will be difficult to support later, because Bytes -> T deserialization can be very differently implemented (think about Native or any other column-based formats). So, I think that Cursor<T> should have only one possible implementation (or several ones if it's required for some sort of WA), but only as implementation details. Yep, it can even use JSON internally if the user wants schemas not supported by RowBinary, but it's not "I want JSON"; it's "I want dynamic schemas" flag/inference.

@slvrtrn's suggestion about query_raw seems much more flexible and doesn't lead to new dependencies. But, again, it doesn't provide any -> T deserialization, only async iterator over chunks. And it's great: if a user wants to parse it on his own, it's okay.

pravic · 2024-09-20T19:09:57Z

Secondly, the initial problem is the absence of dynamic schemas

Correct. I stated the original problem: it's impossible right now to get the result of DESCRIBE TABLE using clickhouse-rs. And #10 is 3 years old already.

Of course, query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when? If another 3 years, then it's better to have at least something that unblocks use cases similar to #10.

If only we had a working PostgreSQL interface, we'd stick with sqlx. Or a native sqlx-clickhouse driver would be even more perfect.

loyd · 2024-09-21T21:51:43Z

query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when?

Actually, I think we can implement query_raw pretty fast. At least, I don't see any issues with it now.

If another 3 years

I totally understand your displeasure. There are several points why it wasn't resolved in time:

This crate was a personal initiative during my free time so that I couldn't solve any more-than-trivial issues unrelated to my needs.
Moving to RowBinaryWithNamesAndTypes seems to result in a performance penalty or unmaintainable code. Native could potentially solve these problems, but it's undocumented and, formally, unstable.

Although, we can use NamesAndTypes only for deserialize_any (dynamic schemas) and probably it won't affect anything.

I hope both reasons become irrelevant now that ClickHouse's team members have begun working on this crate.

Or a native sqlx-clickhouse driver would be even more perfect.

Maybe yes, maybe not. It's not a simple statement. I liked the SQLx-like approach much more until we (I mean, in my team) started actively using it (for PostgreSQL), and.. ok, let's say that it's full of pain for multiple nontrivial tasks. But I don't want to delve into this topic here, it's worth considering after the sqlx support by detailed comparison and benchmarking. I'll be totally fine if CH env team decides to focus on sqlx instead as the official approach.

So, query_raw()? Anyway it's useful to have in this crate.

pravic · 2024-09-22T16:23:53Z

So, query_raw()?

@loyd All right. Thanks for the very detailed explanation about the background of this!

Print the source JSON string in the error message.

Allows row deserialization into a `T: Deserialize`, which eliminates the limitations of `Query::fetch`: * when the table schema is not known: `SELECT * from ?` * when the table schema is not specified: `DESCRIBE TABLE ?` * when we read less columns than we select

slvrtrn · 2024-11-28T12:37:53Z

@pravic, do you think something like this could work instead - #182? That draft also contains potential usage examples.

serprex requested a review from slvrtrn September 18, 2024 18:36

serprex previously approved these changes Sep 18, 2024

View reviewed changes

pravic dismissed serprex’s stale review via 684c2ce September 18, 2024 20:56

pravic force-pushed the fetch-json-format branch from 7ff222b to 684c2ce Compare September 18, 2024 20:56

pravic marked this pull request as ready for review September 19, 2024 07:50

serprex requested a review from loyd September 20, 2024 17:33

loyd mentioned this pull request Oct 18, 2024

Improve SELECTs performance and add RowCursor::{decoded_bytes,received_bytes} #169

Merged

3 tasks

slvrtrn mentioned this pull request Nov 8, 2024

Add methods that allow to insert/fetch a stream of bytes in arbitrary format #174

Open

pravic force-pushed the fetch-json-format branch from 476fd1a to 55da617 Compare November 27, 2024 12:24

pravic added 3 commits November 28, 2024 09:19

fix(cursor): A better error message in JsonCursor.

238d6ba

Print the source JSON string in the error message.

doc: Add a note about JSON object in results.

d41ac2b

pravic force-pushed the fetch-json-format branch from 55da617 to d41ac2b Compare November 28, 2024 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch rows in JSON format #152

Fetch rows in JSON format #152

pravic commented Sep 17, 2024 •

edited

Loading

CLAassistant commented Sep 17, 2024 •

edited

Loading

slvrtrn commented Sep 18, 2024

slvrtrn commented Sep 18, 2024

serprex commented Sep 18, 2024

pravic commented Sep 18, 2024

loyd commented Sep 20, 2024

pravic commented Sep 20, 2024

loyd commented Sep 21, 2024 •

edited

Loading

pravic commented Sep 22, 2024

slvrtrn commented Nov 28, 2024 •

edited

Loading

Fetch rows in JSON format #152

Are you sure you want to change the base?

Fetch rows in JSON format #152

Conversation

pravic commented Sep 17, 2024 • edited Loading

Summary

Unresolved questions

Checklist

CLAassistant commented Sep 17, 2024 • edited Loading

slvrtrn commented Sep 18, 2024

slvrtrn commented Sep 18, 2024

serprex commented Sep 18, 2024

pravic commented Sep 18, 2024

loyd commented Sep 20, 2024

pravic commented Sep 20, 2024

loyd commented Sep 21, 2024 • edited Loading

pravic commented Sep 22, 2024

slvrtrn commented Nov 28, 2024 • edited Loading

pravic commented Sep 17, 2024 •

edited

Loading

CLAassistant commented Sep 17, 2024 •

edited

Loading

loyd commented Sep 21, 2024 •

edited

Loading

slvrtrn commented Nov 28, 2024 •

edited

Loading