Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Support read with settings #367

Merged

Conversation

harryshi10
Copy link
Contributor

@harryshi10 harryshi10 commented Nov 8, 2024

Summary

allow read with settings.

close #272

@CLAassistant
Copy link

CLAassistant commented Nov 8, 2024

CLA assistant check
All committers have signed the CLA.

@@ -209,4 +209,13 @@ object ClickHouseSQLConf {
.stringConf
.transform(_.toLowerCase)
.createWithDefault("binary")

val READ_WITH_SETTINGS: ConfigEntry[String] =
buildConf("spark.clickhouse.read.withSettings")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buildConf("spark.clickhouse.read.withSettings")
buildConf("spark.clickhouse.read.settings")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will also renaming the val to READ_SETTINGS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


val READ_WITH_SETTINGS: ConfigEntry[String] =
buildConf("spark.clickhouse.read.withSettings")
.doc("Query-level settings when reading from ClickHouse. e.g. `final=1, max_execution_time=5`")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the spark perspective, it's a session-level configuration

SET spark.clickhouse.read.settings=final=1;
SELECT * FROM ch.db.t1; -- affected
SELECT * FROM ch.db.t2; -- affected too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. will update the doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -60,6 +61,7 @@ abstract class ClickHouseReader[Record](
|WHERE (${part.partFilterExpr}) AND (${scanJob.filtersExpr})
|${scanJob.groupByClause.getOrElse("")}
|${scanJob.limit.map(n => s"LIMIT $n").getOrElse("")}
|${if (readWithSettings.nonEmpty) s"SETTINGS $readWithSettings" else ""}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|${if (readWithSettings.nonEmpty) s"SETTINGS $readWithSettings" else ""}
|${readSettings.map(settings => s"SETTINGS $settings" else ""}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

.version("0.9.0")
.stringConf
.transform(_.toLowerCase)
.createWithDefault("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.createWithDefault("")
.createOptional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update it to OptionalConfigEntry accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@mshustov mshustov requested review from mzitnik and pan3793 November 9, 2024 13:13
Copy link
Collaborator

@mzitnik mzitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harryshi10 Can we provide tests of the new feature?

@mzitnik
Copy link
Collaborator

mzitnik commented Nov 10, 2024

@pan3793, do you think we also need to provide write.settings?

@pan3793 pan3793 changed the title feature/allow-read-with-settings Spark: Support read with settings Nov 11, 2024
.doc("Settings when read from ClickHouse. e.g. `final=1, max_execution_time=5`")
.version("0.9.0")
.stringConf
.transform(_.toLowerCase)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should convert the input to lower case?

Copy link
Contributor Author

@harryshi10 harryshi10 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case we'd like to log scanQuery which keeps SQL keywords in uppercase and other text in lowercase

@pan3793
Copy link
Collaborator

pan3793 commented Nov 11, 2024

code change lgtm, would be great if you could provide a test case

@pan3793
Copy link
Collaborator

pan3793 commented Nov 11, 2024

@pan3793, do you think we also need to provide write.settings?

yes, it could be implemented in another PR.

Additionally, SPARK-36680 (Spark 4.0) provides a more intuitive SQL syntax for this case

SELECT * FROM $t1 WITH (split-size = 5)

@harryshi10
Copy link
Contributor Author

sorry I'm still a rookie at Scala. but I will try to write a UT for this new feature

@mzitnik
Copy link
Collaborator

mzitnik commented Nov 14, 2024

@harryshi10 could please sign CLA

@harryshi10
Copy link
Contributor Author

@harryshi10 could please sign CLA

done

@harryshi10
Copy link
Contributor Author

code change lgtm, would be great if you could provide a test case

Sorry, I can’t provide a unit test, but here’s a test case I ran locally with PySpark.

env - ClickHouse = 24.10.2.80, Spark = 3.5.0


A SummingMergeTree with two records sharing the same key shows duplicates when queried without FINAL, but returns aggregated results when queried with FINAL.

image

In Spark, setting final=0 or final=1 in spark.clickhouse.read.settings controls whether the results are aggregated or not, with final=0 showing non-aggregated results and final=1 providing aggregated results.

image

I also tested that adding final=0 or 1 in spark.clickhouse.read.settings has no side effect on other engines, such as MergeTree.

image image

@mshustov mshustov requested a review from pan3793 December 3, 2024 09:01
@pan3793
Copy link
Collaborator

pan3793 commented Dec 3, 2024

please refer to the error message to fix the docs, other lgtm

@harryshi10
Copy link
Contributor Author

please refer to the error message to fix the docs, other lgtm

doc updated to indicate the default value is None. sorry for missing of that and the delay of reply

Copy link
Collaborator

@mzitnik mzitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we changed something in our tests?

@mzitnik mzitnik merged commit 7a06a13 into ClickHouse:main Dec 9, 2024
16 of 36 checks passed
@mzitnik
Copy link
Collaborator

mzitnik commented Dec 15, 2024

@harryshi10 can you please sign the CLA

@huashi-st
Copy link
Contributor

@harryshi10 can you please sign the CLA

Signed. Apologies, I just realized I used two different accounts to push this PR and signed with both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support to set ClickHouse query settings when reading table
5 participants