Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any benchmarks? #298

Open
dolfinus opened this issue Feb 13, 2024 · 2 comments
Open

Any benchmarks? #298

dolfinus opened this issue Feb 13, 2024 · 2 comments

Comments

@dolfinus
Copy link

dolfinus commented Feb 13, 2024

Hi.

Do you have any benchmarks for reading & writing data using Spark Housepower connector vs others, like official JDBC driver?

Spark ClickHouse Connector is a high performance connector but for me it is actually slower than JDBC. For example, writing 32Gb of data (3 columns, 2 billion rows):

Connector Partitions batchsize Time
JDBC 1500 2_000_000 6.7 min
JDBC 40 5_000_000 1.8 min
Housepower 1500 2_000_000 11 min
Housepower 40 5_000_000 6.8 min

Packages I've used:

maven_packages = [
    "com.github.housepower:clickhouse-spark-runtime-3.4_2.12:0.7.3",
    "com.clickhouse:clickhouse-jdbc:0.6.0",
    "com.clickhouse:clickhouse-client:0.6.0",
    "com.clickhouse:clickhouse-http-client:0.6.0",
    "org.apache.httpcomponents.client5:httpclient5:5.3.1",
]

Config:

spark.conf.set("spark.sql.catalog.clickhouse", "xenon.clickhouse.ClickHouseCatalog")
spark.conf.set("spark.sql.catalog.clickhouse.host", "my.clickhouse.domain")
spark.conf.set("spark.sql.catalog.clickhouse.protocol", "http")
spark.conf.set("spark.sql.catalog.clickhouse.http_port", "40101")
spark.conf.set("spark.sql.catalog.clickhouse.user", "default")
spark.conf.set("spark.sql.catalog.clickhouse.password", "")
spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.async", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.client_name", "onetl")
spark.conf.set("spark.sql.catalog.clickhouse.option.socket_keepalive", "true")
spark.conf.set("spark.clickhouse.ignoreUnsupportedTransform", "false")
spark.conf.set("spark.clickhouse.read.distributed.useClusterNodes", "false")
spark.conf.set("spark.clickhouse.read.distributed.convertLocal", "false")
spark.conf.set("spark.clickhouse.write.batchSize", 5_000_000)
spark.conf.set("spark.clickhouse.write.repartitionStrictly", "false")
spark.conf.set("spark.clickhouse.write.repartitionByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByKey", "false")
spark.conf.set("spark.clickhouse.write.distributed.useClusterNodes", "true")
spark.conf.set("spark.clickhouse.write.distributed.convertLocal", "false")
@pan3793
Copy link
Collaborator

pan3793 commented Feb 20, 2024

No specific benchmark as Spark and ClickHouse usually run in large clusters.

there are some generic perf tunes guide mentioned in #265 (comment)

@dolfinus
Copy link
Author

I've already set repartitionByPartition=false to avoid repartition on the side of connector. In Spark UI all executors (40 in my case) got the same number of rows, so there was no data skew. Both JDBC and Housepower connectors got the same dataframe with the same distribution and number of partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants