[EPIC] Improved aggregate function performance #13548

alamb · 2024-11-24T13:51:38Z

Is your feature request related to a problem or challenge?

The basic aggregate functions like COUNT and SUM in DataFusion are very fast (see Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)

However, many of the other aggregate functions are not particularly fast, and this shows up specifically on some of the H20 benchmarks

We saw this in the results in the 2024 DataFusion SIGMOD paper

(BTW we have made median faster)

@MrPowers has also observed similar results on discord (link):

DataFusion was added to the h2o benchmarks (which are now maintained by duckdb) and DataFusion performs quite well for most of the "basic" groupby queries. It performs poorly for some of the advanced questions on the 50GB dataset. Here are the results:
https://duckdblabs.github.io/db-benchmark/

See his version of the benchmarks here
https://github.com/MrPowers/mrpowers-benchmarks

Functions

Describe the solution you'd like

DataFusion has two APIs ways to implement Aggregate functions like SUM and COUNT

Easy (but slow) way: Accumulator (api docs)
Fast (but complicated way): GroupsAccumulator (api docs)

The basic aggregates are implemented using GroupsAccumulator and are part of DataFusions performance

This ticket tracks the effort to improve the performance of these for these "more advanced" aggregate functions, likely by implementing GroupsAccumulator

Describe alternatives you've considered

For each function listed above, ideally we would:

Add a new benchmark. Either add a specific one for H20 benchmarks or add a query to the ClickBench extended benchmark Documentation Here in one PR
Implement GroupsAccumulator for the relevant aggregate function in a second PR (along with tests for correctness). We would use the benchmark to verify the performance

Here is a pretty good example of how @eejbyfeldt did this for STDDEV:

Implement groups accumulator for stddev and variance #12095

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-11-24T22:12:13Z

For posterity, here is a link to the discord chat: https://discord.com/channels/885562378132000778/1309883046886903870/1309887744595595324

MrPowers · 2024-11-25T13:31:52Z

Would like to note that the DataFusion performance really starts to lag when the dataset size grows.

Take a look at this query: select id2, id4, power(corr(v1, v2), 2) as r2 from x group by id2, id4.

When the dataset is 10 million rows, then Polars takes 3 seconds and DataFusion takes 3.6 seconds, so pretty similar.

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

alamb · 2024-11-25T20:53:37Z

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?

@Rachelint has some ideas of how to improve this:

Dandandan · 2024-11-25T21:02:09Z

Hm this seems something quadratic in nature?

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?

@Rachelint has some ideas of how to improve this:

Sketch for aggregation intermediate results blocked management #11943

Manage group values and states by blocks in aggregation #11931

Does it fully explain the dramatic difference? @MrPowers how do you generate the 10M vs 100M rows?

alamb · 2024-11-25T21:05:02Z

I would also expect this to help (but it was merged and depends on when it is merged)

Skipping partial aggregation when it is not helping for high cardinality aggregates #11627
I think the right thing to do is to get the query / dataset and profile it

MrPowers · 2024-11-26T01:40:29Z

@Dandandan - thanks to the great work by @SemyonSinchenko, it's easy to generate these datasets with falsa.

Here's the command to generate the 10 million row dataset: falsa groupby --path-prefix=~/data --size SMALL --data-format PARQUET. Just use MEDIUM to generate the 100 million row dataset.

Rachelint · 2024-11-26T02:02:34Z

@Dandandan - thanks to the great work by @SemyonSinchenko, it's easy to generate these datasets with falsa.

Here's the command to generate the 10 million row dataset: falsa groupby --path-prefix=~/data --size SMALL --data-format PARQUET. Just use MEDIUM to generate the 100 million row dataset.

Thanks, I will profile and see what happen about the so long time cost in datafusion.

Rachelint · 2024-11-26T04:28:49Z

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?

@Rachelint has some ideas of how to improve this:
* [Sketch for aggregation intermediate results blocked management #11943](https://github.com/apache/datafusion/pull/11943)

* [Manage group values and states by blocks in aggregation #11931](https://github.com/apache/datafusion/issues/11931)

🤔 I guess it may be caused by the similar reason of what we encountered during benchmarking in #11827

alamb · 2024-11-26T20:00:21Z

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?
@Rachelint has some ideas of how to improve this:
* [Sketch for aggregation intermediate results blocked management #11943](https://github.com/apache/datafusion/pull/11943)

* [Manage group values and states by blocks in aggregation #11931](https://github.com/apache/datafusion/issues/11931)
🤔 I guess it may be caused by the similar reason of what we encountered during benchmarking in #11827

Specifically that power and corr need to support convert_to_state?

Rachelint · 2024-11-27T15:18:22Z

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

What version are you working with?
@Rachelint has some ideas of how to improve this:
* [Sketch for aggregation intermediate results blocked management #11943](https://github.com/apache/datafusion/pull/11943)

* [Manage group values and states by blocks in aggregation #11931](https://github.com/apache/datafusion/issues/11931)
🤔 I guess it may be caused by the similar reason of what we encountered during benchmarking in #11827
Specifically that power and corr need to support convert_to_state?

I am not sure, but I think it maybe really related to GroupAccumulatorAdapter as #11827?
I am running and profiling it to find the answer.

2010YOUY01 · 2024-11-27T16:07:12Z

I rerun the H2O Q9 with GroupsAccumulator for corr(). See #13581
h2o dataset is in parquet format

Result
----
main, h2o_10m: 0.8s
main, h2o_100m: 12s
pr, h2o_10m: 0.2s
pr, h2o_100m: 4s

I didn't reproduce the drastic slowdown in main branch🤔

When the dataset is 10 million rows, then Polars takes 3 seconds and DataFusion takes 3.6 seconds, so pretty similar.

When the dataset is 100 million rows, then Polars takes 126 seconds and DataFusion takes 2,100 seconds.

MrPowers · 2024-11-27T17:17:17Z

The h2o benchmarks are run on a Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz machine with 128 cores and 250 GB of RAM.

DataFusion groupby queries perform well on the 100 million row dataset (~5GB of data in a CSV file):

Some don't run with the 1 billion row dataset (~50GB of data in an uncompressed CSV file):

I am using a M3 Macbook with 16 GB of RAM. How much RAM does your machine have? Perhaps DataFusion only struggles with query 9 when the machine doesn't have lots of extra RAM.

Dandandan · 2024-11-27T22:24:35Z

GroupsAccumulator for median/corr reduces memory usage (should be by quite a bit).

Dandandan · 2024-11-27T22:25:26Z

Looking at the benchmark results, I think query 8 is worth analyzing / optimizing as well:
#13548

2010YOUY01 · 2024-11-28T12:09:45Z

I am using a M3 Macbook with 16 GB of RAM. How much RAM does your machine have? Perhaps DataFusion only struggles with query 9 when the machine doesn't have lots of extra RAM.

This explains 👍🏼 I ran the benchmark on a macbook with 48G of ram.
It is likely Q9 requires > 16G RAM, and OS memory swapping caused the performance regression.

We should also take a look at how much memory does DataFusion consume for those queries, comparing to other systems. Thanks for the report.

Rachelint · 2024-11-28T15:18:44Z

I am using a M3 Macbook with 16 GB of RAM. How much RAM does your machine have? Perhaps DataFusion only struggles with query 9 when the machine doesn't have lots of extra RAM.

This explains 👍🏼 I ran the benchmark on a macbook with 48G of ram. It is likely Q9 requires > 16G RAM, and OS memory swapping caused the performance regression.

We should also take a look at how much memory does DataFusion consume for those queries, comparing to other systems. Thanks for the report.

Yes, I run it today, and my machine has only 16GB memory too... and I found the query very very slow due to swapping, too...

alamb · 2024-12-02T18:32:19Z

I think making DataFusion work better in lower memory situations would certainly be nice

alamb added the enhancement New feature or request label Nov 24, 2024

This was referenced Nov 24, 2024

Improve performance of corr function #13549

Open

Improve performance of median function #13550

Open

alamb mentioned this issue Dec 3, 2024

Dec 3. 2024: This week in DataFusion #13630

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improved aggregate function performance #13548

[EPIC] Improved aggregate function performance #13548

alamb commented Nov 24, 2024 •

edited

Loading

alamb commented Nov 24, 2024

MrPowers commented Nov 25, 2024

alamb commented Nov 25, 2024

Dandandan commented Nov 25, 2024

alamb commented Nov 25, 2024 •

edited

Loading

MrPowers commented Nov 26, 2024

Rachelint commented Nov 26, 2024

Rachelint commented Nov 26, 2024 •

edited

Loading

alamb commented Nov 26, 2024

Rachelint commented Nov 27, 2024 •

edited

Loading

2010YOUY01 commented Nov 27, 2024

MrPowers commented Nov 27, 2024

Dandandan commented Nov 27, 2024

Dandandan commented Nov 27, 2024

2010YOUY01 commented Nov 28, 2024

Rachelint commented Nov 28, 2024

alamb commented Dec 2, 2024

[EPIC] Improved aggregate function performance #13548

[EPIC] Improved aggregate function performance #13548

Comments

alamb commented Nov 24, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Functions

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Nov 24, 2024

MrPowers commented Nov 25, 2024

alamb commented Nov 25, 2024

Dandandan commented Nov 25, 2024

alamb commented Nov 25, 2024 • edited Loading

MrPowers commented Nov 26, 2024

Rachelint commented Nov 26, 2024

Rachelint commented Nov 26, 2024 • edited Loading

alamb commented Nov 26, 2024

Rachelint commented Nov 27, 2024 • edited Loading

2010YOUY01 commented Nov 27, 2024

MrPowers commented Nov 27, 2024

Dandandan commented Nov 27, 2024

Dandandan commented Nov 27, 2024

2010YOUY01 commented Nov 28, 2024

Rachelint commented Nov 28, 2024

alamb commented Dec 2, 2024

alamb commented Nov 24, 2024 •

edited

Loading

alamb commented Nov 25, 2024 •

edited

Loading

Rachelint commented Nov 26, 2024 •

edited

Loading

Rachelint commented Nov 27, 2024 •

edited

Loading