Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats_agg (1D) produces invalid results when executed in parallel #811

Open
pelder-lumenix opened this issue Sep 27, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@pelder-lumenix
Copy link

Relevant system information:

  • OS: RedHat linux, x86_64, 64bit
  • PostgreSQL version (output of SELECT version();): 15.8
  • TimescaleDB Toolkit version (output of \dx timescaledb_toolkit in psql): 1.18.0
  • TimescaleDB version: 1.16.1
  • Installation method: Managed Services for Timescale (MST)

Describe the bug

A 1D stats_agg over a large number of rows (~300K) produces garbage results when executed on a 2CPU Managed Services for Timescale (MST) installation. Preventing the query planner from using parallelism (via SET max_parallel_workers_per_gather = 0;) produces correct results. Incorrect results occur even if a single parallel worker is used (SET max_parallel_workers_per_gather = 1;).

To Reproduce

I have been unable to reproduce using local docker (MacOS, M1) or a small Timescale Cloud instance (0.5CPU), however, I can reproduce on multiple MST instances.

Steps to reproduce the behavior:

  1. Using Managed Service for Timescale, create a dev instance (2CPU/4MB)
  2. Create a table as follows:
CREATE TABLE IF NOT EXISTS device_stats
(
    ts timestamp with time zone,
    device_location_id bigint,
    heatsink_temp_c double precision
);
  1. Import the attached CSV data (this is zipped for compactness), sample_device_stats.csv.zip
  2. Execute the following query, and note the stats_agg average is incorrect compared to avg
SET max_parallel_workers_per_gather = 1;

select
    time_bucket('1h', ts) as bucket,
    device_location_id,
    average(stats_agg(heatsink_temp_c)) as average,
    avg(heatsink_temp_c) as avg,
    stats_agg(heatsink_temp_c)
from aims_device_stats
group by 1, 2

Expected behavior
Correct averages from average(stats_agg()) that are equal to the values returned by avg().

Actual behavior
stats_agg 'average' is null, and the dumped statssummary1d contains zero or near-zero values.

Example incorrect output

ts device_location_id average(stats_agg) avg stats_agg
2024-09-26 21:00:00+00 98 null 50.81733333333338 (version:147,n:0,sx:0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046609608394683,sx2:0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046609608394691,sx3:0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000593,sx4:6098.080000000005)
2024-09-26 19:00:00+00 39 null 42.61495798319328 (version:1,n:0,sx:0,sx2:0,sx3:0,sx4:0)

When I compare the statssummary1d dump from correct results to the erroneous results, there are numerous instances where 'version' much higher than 1, n is 0 and the sx values are zero or extremely small fractions.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
This is definitely related to parallelism. Small queries or explicit actions that prevent the planner from choosing parallel execution produce correct results.

@pelder-lumenix pelder-lumenix added the bug Something isn't working label Sep 27, 2024
@JamesGuthrie
Copy link
Member

Thank you for the thorough bug report. What you are seeing is indeed a bug which is only triggered when parallel workers are used to process a stats_agg aggregate. PR #828 should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants