Any benchmarks? #298

dolfinus · 2024-02-13T09:35:39Z

Hi.

Do you have any benchmarks for reading & writing data using Spark Housepower connector vs others, like official JDBC driver?

Spark ClickHouse Connector is a high performance connector but for me it is actually slower than JDBC. For example, writing 32Gb of data (3 columns, 2 billion rows):

Connector	Partitions	batchsize	Time
JDBC	1500	2_000_000	6.7 min
JDBC	40	5_000_000	1.8 min
Housepower	1500	2_000_000	11 min
Housepower	40	5_000_000	6.8 min

Packages I've used:

maven_packages = [
    "com.github.housepower:clickhouse-spark-runtime-3.4_2.12:0.7.3",
    "com.clickhouse:clickhouse-jdbc:0.6.0",
    "com.clickhouse:clickhouse-client:0.6.0",
    "com.clickhouse:clickhouse-http-client:0.6.0",
    "org.apache.httpcomponents.client5:httpclient5:5.3.1",
]

Config:

spark.conf.set("spark.sql.catalog.clickhouse", "xenon.clickhouse.ClickHouseCatalog")
spark.conf.set("spark.sql.catalog.clickhouse.host", "my.clickhouse.domain")
spark.conf.set("spark.sql.catalog.clickhouse.protocol", "http")
spark.conf.set("spark.sql.catalog.clickhouse.http_port", "40101")
spark.conf.set("spark.sql.catalog.clickhouse.user", "default")
spark.conf.set("spark.sql.catalog.clickhouse.password", "")
spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.async", "false")
spark.conf.set("spark.sql.catalog.clickhouse.option.client_name", "onetl")
spark.conf.set("spark.sql.catalog.clickhouse.option.socket_keepalive", "true")
spark.conf.set("spark.clickhouse.ignoreUnsupportedTransform", "false")
spark.conf.set("spark.clickhouse.read.distributed.useClusterNodes", "false")
spark.conf.set("spark.clickhouse.read.distributed.convertLocal", "false")
spark.conf.set("spark.clickhouse.write.batchSize", 5_000_000)
spark.conf.set("spark.clickhouse.write.repartitionStrictly", "false")
spark.conf.set("spark.clickhouse.write.repartitionByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByPartition", "false")
spark.conf.set("spark.clickhouse.write.localSortByKey", "false")
spark.conf.set("spark.clickhouse.write.distributed.useClusterNodes", "true")
spark.conf.set("spark.clickhouse.write.distributed.convertLocal", "false")

The text was updated successfully, but these errors were encountered:

pan3793 · 2024-02-20T06:16:37Z

No specific benchmark as Spark and ClickHouse usually run in large clusters.

there are some generic perf tunes guide mentioned in #265 (comment)

dolfinus · 2024-02-20T08:23:43Z

I've already set repartitionByPartition=false to avoid repartition on the side of connector. In Spark UI all executors (40 in my case) got the same number of rows, so there was no data skew. Both JDBC and Housepower connectors got the same dataframe with the same distribution and number of partitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any benchmarks? #298

Any benchmarks? #298

dolfinus commented Feb 13, 2024 •

edited

Loading

pan3793 commented Feb 20, 2024

dolfinus commented Feb 20, 2024

Any benchmarks? #298

Any benchmarks? #298

Comments

dolfinus commented Feb 13, 2024 • edited Loading

pan3793 commented Feb 20, 2024

dolfinus commented Feb 20, 2024

dolfinus commented Feb 13, 2024 •

edited

Loading