stats_agg (1D) produces invalid results when executed in parallel #811

pelder-lumenix · 2024-09-27T20:05:10Z

Relevant system information:

OS: RedHat linux, x86_64, 64bit
PostgreSQL version (output of SELECT version();): 15.8
TimescaleDB Toolkit version (output of \dx timescaledb_toolkit in psql): 1.18.0
TimescaleDB version: 1.16.1
Installation method: Managed Services for Timescale (MST)

Describe the bug

A 1D stats_agg over a large number of rows (~300K) produces garbage results when executed on a 2CPU Managed Services for Timescale (MST) installation. Preventing the query planner from using parallelism (via SET max_parallel_workers_per_gather = 0;) produces correct results. Incorrect results occur even if a single parallel worker is used (SET max_parallel_workers_per_gather = 1;).

To Reproduce

I have been unable to reproduce using local docker (MacOS, M1) or a small Timescale Cloud instance (0.5CPU), however, I can reproduce on multiple MST instances.

Steps to reproduce the behavior:

Using Managed Service for Timescale, create a dev instance (2CPU/4MB)
Create a table as follows:

CREATE TABLE IF NOT EXISTS device_stats
(
    ts timestamp with time zone,
    device_location_id bigint,
    heatsink_temp_c double precision
);

Import the attached CSV data (this is zipped for compactness), sample_device_stats.csv.zip
Execute the following query, and note the stats_agg average is incorrect compared to avg

SET max_parallel_workers_per_gather = 1;

select
    time_bucket('1h', ts) as bucket,
    device_location_id,
    average(stats_agg(heatsink_temp_c)) as average,
    avg(heatsink_temp_c) as avg,
    stats_agg(heatsink_temp_c)
from aims_device_stats
group by 1, 2

Expected behavior
Correct averages from average(stats_agg()) that are equal to the values returned by avg().

Actual behavior
stats_agg 'average' is null, and the dumped statssummary1d contains zero or near-zero values.

Example incorrect output

ts	device_location_id	average(stats_agg)	avg	stats_agg
2024-09-26 21:00:00+00	98	null	50.81733333333338	(version:147,n:0,sx:0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046609608394683,sx2:0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046609608394691,sx3:0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000593,sx4:6098.080000000005)
2024-09-26 19:00:00+00	39	null	42.61495798319328	(version:1,n:0,sx:0,sx2:0,sx3:0,sx4:0)

When I compare the statssummary1d dump from correct results to the erroneous results, there are numerous instances where 'version' much higher than 1, n is 0 and the sx values are zero or extremely small fractions.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
This is definitely related to parallelism. Small queries or explicit actions that prevent the planner from choosing parallel execution produce correct results.

The text was updated successfully, but these errors were encountered:

JamesGuthrie · 2024-11-13T12:14:54Z

Thank you for the thorough bug report. What you are seeing is indeed a bug which is only triggered when parallel workers are used to process a stats_agg aggregate. PR #828 should fix it.

pelder-lumenix added the bug Something isn't working label Sep 27, 2024

JamesGuthrie mentioned this issue Nov 13, 2024

fix: correct broken serialization #828

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats_agg (1D) produces invalid results when executed in parallel #811

stats_agg (1D) produces invalid results when executed in parallel #811

pelder-lumenix commented Sep 27, 2024

JamesGuthrie commented Nov 13, 2024

stats_agg (1D) produces invalid results when executed in parallel #811

stats_agg (1D) produces invalid results when executed in parallel #811

Comments

pelder-lumenix commented Sep 27, 2024

JamesGuthrie commented Nov 13, 2024