Implement GroupsAccumulator for corr(x,y) aggregate function #13581

2010YOUY01 · 2024-11-27T15:29:53Z

Which issue does this PR close?

Rationale for this change

Implement GroupsAccumulator for corr aggregation function, for better performance when group cardinality is high

I rerun the H2o benchmark:

Data Generation

falsa groupby --path-prefix=/Users/yongting/data/ --size MEDIUM --data-format PARQUET
https://github.com/mrpowers-io/falsa

Run benchmark in datafusion-cli

CREATE EXTERNAL TABLE IF NOT EXISTS h2o_100m (
    id1 VARCHAR NOT NULL,
    id2 VARCHAR NOT NULL,
    id3 VARCHAR NOT NULL,
    id4 INTEGER NOT NULL,
    id5 INTEGER NOT NULL,
    id6 INTEGER NOT NULL,
    v1 INTEGER NOT NULL,
    v2 INTEGER NOT NULL,
    v3 DOUBLE PRECISION NOT NULL
)
STORED AS parquet
LOCATION '/Users/yongting/data/G1_1e8_1e8_100_0.parquet';

select id2, id4, power(corr(v1, v2), 2) as r2 from h2o_100m group by id2, id4;

Result

Main: 12s
This PR: 4s
(On my MacBook with m4 pro)

Remaining tasks
~~Implement convert_to_states()~~ This requires changes in aggregate fuzzer for test coverage, which can be done later to keep this PR small

What changes are included in this PR?

Implement two utility functions: accumulate_multiple and accumulate_correlation_states to accumulate states in correlation function. (existing util functions is for aggregate functions with 1 input expr avg(expr1) v.s. corr(expr1, expr2))
Implement GroupsAccumulator for corr()

Are these changes tested?

Unit tests for util functions
corr() is covered by existing tests

Are there any user-facing changes?

No

alamb · 2024-11-27T16:52:06Z

This looks amazing -- thank you @2010YOUY01

I plan to review it over the next day or two

It seems like maybe we should add the data generator for h2o benchmark to the bench.sh script 🤔

Dandandan · 2024-11-27T17:03:24Z

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs

+            let nulls = arr
+                .nulls()
+                .expect("If null_count() > 0, nulls must be present");
+            match combined_nulls {


If passing combined_nulls to NullBuffer::union it will take care of handling Option

Dandandan · 2024-11-28T10:07:42Z

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs

+    T: ArrowPrimitiveType + Send,
+    F: FnMut(usize, &[T::Native]) + Send,
+{
+    let acc_cols: Vec<&[T::Native]> = value_columns


I think collecting into Vec might not be necessary?

This is true, I have updated

Dandandan · 2024-11-28T12:27:42Z

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs

+            for (idx, &group_idx) in group_indices.iter().enumerate() {
+                // Get `idx`-th row from all value(accumulate) columns
+                let row_values: Vec<_> =
+                    value_columns.iter().map(|col| col.value(idx)).collect();


It would be nice to avoid collecting here? Can we take an iterator instead in value_fn?

The bench query above went from 4s -> 3s after this change! Great catch
Though I have to structure the code differently to make it compile, it's a bit more complex so I added more comment

datafusion/functions-aggregate/src/correlation.rs

jayzhan211 · 2024-12-02T11:25:40Z

datafusion/functions-aggregate/src/correlation.rs

+        self.sum_xx.resize(total_num_groups, 0.0);
+        self.sum_yy.resize(total_num_groups, 0.0);
+
+        let array_x = &cast(&values[0], &DataType::Float64)?;


I think casting should be handled in logical optimizer. Fixing the signature of Correlation might helps

You're right, this line is redundant
Addressed in 98cba91

I tried to remove it, all tests passed but the benchmark query above won't run 🤔 The existing signature looks correct to me, this might need further investigation
Issued #13721

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs

Dandandan · 2024-12-07T13:49:36Z

Hi, I think this one is pretty close, do you have time to look at the review comments @2010YOUY01 ?

2010YOUY01 · 2024-12-07T17:45:01Z

Hi, I think this one is pretty close, do you have time to look at the review comments @2010YOUY01 ?

Yes, I will be back and finish this PR in next 2 days, I'm traveling and afk this week. Thanks for the attention to this ticket

alamb · 2024-12-07T19:24:29Z

I also harbor hopes of contributing a benchmark for corr, hopefully

(to be clear not for this PR)

2010YOUY01 · 2024-12-10T17:46:01Z

Thank you all for the review, it's ready for another look

Dandandan · 2024-12-10T18:46:13Z

Very nice work @2010YOUY01

Implement GroupsAccumulator for corr(x,y)

a834fda

github-actions bot added the functions label Nov 27, 2024

2010YOUY01 mentioned this pull request Nov 27, 2024

[EPIC] Improved aggregate function performance #13548

Open

2 tasks

Dandandan reviewed Nov 27, 2024

View reviewed changes

Dandandan reviewed Nov 28, 2024

View reviewed changes

2010YOUY01 added 2 commits November 28, 2024 19:27

feedbacks

773b9c5

fix CI MSRV

380ef0a

2010YOUY01 force-pushed the faster-corr branch from 0d6e2c7 to 380ef0a Compare November 28, 2024 12:27

Dandandan reviewed Nov 28, 2024

View reviewed changes

Dandandan reviewed Nov 29, 2024

View reviewed changes

datafusion/functions-aggregate/src/correlation.rs Outdated Show resolved Hide resolved

jayzhan211 reviewed Dec 2, 2024

View reviewed changes

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs Outdated Show resolved Hide resolved

jayzhan211 reviewed Dec 2, 2024

View reviewed changes

datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs Outdated Show resolved Hide resolved

2010YOUY01 added 3 commits December 11, 2024 00:35

review

98cba91

avoid collect in accumulation

66fb41e

add back cast

8c84406

2010YOUY01 mentioned this pull request Dec 10, 2024

Avoid explicit cast during execution in corr aggregate function #13721

Open

Dandandan approved these changes Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

2010YOUY01 commented Nov 27, 2024 •

edited

Loading

alamb commented Nov 27, 2024

Dandandan Nov 27, 2024 •

edited

Loading

Dandandan Nov 28, 2024 •

edited

Loading

2010YOUY01 Nov 28, 2024

Dandandan Nov 28, 2024

2010YOUY01 Dec 10, 2024

Dandandan Dec 10, 2024

jayzhan211 Dec 2, 2024

2010YOUY01 Dec 10, 2024

2010YOUY01 Dec 10, 2024

Dandandan commented Dec 7, 2024

2010YOUY01 commented Dec 7, 2024

alamb commented Dec 7, 2024 •

edited

Loading

2010YOUY01 commented Dec 10, 2024

Dandandan commented Dec 10, 2024

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

Are you sure you want to change the base?

Implement GroupsAccumulator for corr(x,y) aggregate function #13581

Conversation

2010YOUY01 commented Nov 27, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

Data Generation

Run benchmark in datafusion-cli

Result

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Nov 27, 2024

Dandandan Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Dandandan Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

2010YOUY01 Nov 28, 2024

Choose a reason for hiding this comment

Dandandan Nov 28, 2024

Choose a reason for hiding this comment

2010YOUY01 Dec 10, 2024

Choose a reason for hiding this comment

Dandandan Dec 10, 2024

Choose a reason for hiding this comment

jayzhan211 Dec 2, 2024

Choose a reason for hiding this comment

2010YOUY01 Dec 10, 2024

Choose a reason for hiding this comment

2010YOUY01 Dec 10, 2024

Choose a reason for hiding this comment

Dandandan commented Dec 7, 2024

2010YOUY01 commented Dec 7, 2024

alamb commented Dec 7, 2024 • edited Loading

2010YOUY01 commented Dec 10, 2024

Dandandan commented Dec 10, 2024

2010YOUY01 commented Nov 27, 2024 •

edited

Loading

Dandandan Nov 27, 2024 •

edited

Loading

Dandandan Nov 28, 2024 •

edited

Loading

alamb commented Dec 7, 2024 •

edited

Loading