Excessive data analysis when querying #542

KhafRuslan · 2024-08-06T13:17:41Z

We have encountered problems on a large amount of data. We tried to parse the query that Qryn makes and we had some questions
SQL query:

explain indexes = 1 WITH sel_a AS (select `samples`.`string` as `string`,`samples`.`fingerprint` as `fingerprint`,samples.timestamp_ns as `timestamp_ns` from qryn.samples_v3_dist as `samples` where ((`samples`.`timestamp_ns`   between 1722865286192000000 and 1722865586192000000) and (`samples`.`type` in (0,0))) and (samples.fingerprint IN (select `sel_1`.`fingerprint` from (select `fingerprint` from `qryn`.`time_series_gin` where ((`key` = 'k8s_clusters') and (`val` = 'k8s-omp-test'))) as `sel_1`  inner any  join (select `fingerprint` from `qryn`.`time_series_gin` where ((`key` = 'ClientId') and (`val` = '123'))) as `sel_2` on `sel_1`.`fingerprint` = `sel_2`.`fingerprint`)) order by `timestamp_ns` desc limit 100) select JSONExtractKeysAndValues(time_series.labels, 'String') as `labels`,sel_a.* from sel_a left any join `qryn`.`time_series_dist` AS time_series on `sel_a`.`fingerprint` = time_series.fingerprint order by `labels` desc,`timestamp_ns` desc

The first part is fine, there is filtering by time:

1 | Expression (Projection)
2 | Sorting (Sorting for ORDER BY)
3 | Expression (Before ORDER BY)
4 | Join (JOIN FillRightFirst)
5 | Expression ((Before JOIN + Projection))
6 | Limit (preliminary LIMIT (without OFFSET))
7 | Sorting (Merge sorted streams after aggregation stage for ORDER BY)
8 | Union
9 | Expression
10 | CreatingSets (Create sets before main query execution)
11 | Sorting (Sorting for ORDER BY)
12 | Expression (Before ORDER BY)
13 | ReadFromMergeTree (qryn.samples_v3)
14 | Indexes:
15 | MinMax
16 | Keys:
17 | timestamp_ns
18 | Condition: and((timestamp_ns in (-Inf, 1722865586192000000]), (timestamp_ns in [1722865286192000000, +Inf)))
19 | Parts: 1/53
20 | Granules: 3/126065
21 | Partition
22 | Keys:
23 | toStartOfDay(toDateTime(divide(timestamp_ns, 1000000000)))
24 | Condition: and((toStartOfDay(toDateTime(divide(timestamp_ns, 1000000000))) in (-Inf, 1722805200]), (toStartOfDay(toDateTime(divide(timestamp_ns, 1000000000))) in [1722805200, +Inf)))
25 | Parts: 1/1
26 | Granules: 3/3
27 | PrimaryKey
28 | Keys:
29 | timestamp_ns
30 | Condition: and((timestamp_ns in (-Inf, 1722865586192000000]), (timestamp_ns in [1722865286192000000, +Inf)))
31 | Parts: 1/1
32 | Granules: 1/3

In the second part, we encountered that it does a full database scan:

33 | CreatingSet (Create set for subquery)
34 | Expression ((Projection + Before ORDER BY))
35 | Join (JOIN FillRightFirst)
36 | Expression ((Before JOIN + (Projection + Before ORDER BY)))
37 | ReadFromMergeTree (qryn.time_series_gin)
38 | Indexes:
39 | MinMax
40 | Condition: true
41 | Parts: 30/30
42 | Granules: 189906/189906
43 | Partition
44 | Condition: true
45 | Parts: 30/30
46 | Granules: 189906/189906
47 | PrimaryKey
48 | Keys:
49 | key
50 | val
51 | Condition: and((key in ['k8s_clusters', 'k8s_clusters']), (val in ['k8s-omp-test', 'k8s-omp-test']))
52 | Parts: 30/30
53 | Granules: 14631/189906
54 | Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
55 | ReadFromMergeTree (qryn.time_series_gin)
56 | Indexes:
57 | MinMax
58 | Condition: true
59 | Parts: 30/30
60 | Granules: 189906/189906
61 | Partition
62 | Condition: true
63 | Parts: 30/30
64 | Granules: 189906/189906
65 | PrimaryKey
66 | Keys:
67 | key
68 | val
69 | Condition: and((key in ['ClientId', 'ClientId']), (val in ['123', '123']))
70 | Parts: 28/30
71 | Granules: 28/189906
72 | ReadFromRemote (Read from remote replica)

The third part is similar, analyzing a lot of data

73 | Union
74 | Expression ((Joined actions + (Rename joined columns + ( + (Projection + Before ORDER BY)))))
75 | ReadFromMergeTree (qryn.time_series)
76 | Indexes:
77 | MinMax
78 | Condition: true
79 | Parts: 37/37
80 | Granules: 14630/14630
81 | Partition
82 | Condition: true
83 | Parts: 37/37
84 | Granules: 14630/14630
85 | PrimaryKey
86 | Condition: true
87 | Parts: 37/37
88 | Granules: 14630/14630
89 | Expression (( + ( + )))
90 | ReadFromRemote (Read from remote replica)

Is it obligatory, is there no possibility to make binding also to time or other way of filtering ?
Looks like an approach with multiple joins doesn’t work well on big amounts of data.

Denormalization and storing labels data in another format may help. There are some options:

Storing labels as Map(LowCardinality(String), String) at the schema otel.otel_logs in this article https://clickhouse.com/blog/storing-log-data-in-clickhouse-fluent-bit-vector-open-telemetry#querying-the-map-type
Look at section “Approach 3: JSON as pairwise arrays” here https://www.propeldata.com/blog/how-to-store-json-in-clickhouse-the-right-way
This approach also use Signoz https://signoz.io/docs/userguide/logs_clickhouse_queries/

The text was updated successfully, but these errors were encountered:

lmangani · 2024-08-28T08:32:42Z

Thanks for the suggestion! We're discussing this internally and will attempt some experiments. Updates will follow.

lmangani · 2024-08-30T08:15:29Z

Thanks for the report and suggestion @KhafRuslan
We are implementing some optimizations for this usecase. Updates will follow once ready to re-test!

lmangani · 2024-09-04T17:55:27Z

@KhafRuslan please let us know if you can retest and confirm if the improvement is noticeable. Thanks!

lmangani added the enhancement New feature or request label Aug 30, 2024

lmangani assigned akvlad Aug 30, 2024

akvlad mentioned this issue Sep 4, 2024

fix #564

Merged

akvlad closed this as completed in #564 Sep 4, 2024

akvlad reopened this Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive data analysis when querying #542

Excessive data analysis when querying #542

KhafRuslan commented Aug 6, 2024 •

edited

Loading

lmangani commented Aug 28, 2024

lmangani commented Aug 30, 2024

lmangani commented Sep 4, 2024

Excessive data analysis when querying #542

Excessive data analysis when querying #542

Comments

KhafRuslan commented Aug 6, 2024 • edited Loading

lmangani commented Aug 28, 2024

lmangani commented Aug 30, 2024

lmangani commented Sep 4, 2024

KhafRuslan commented Aug 6, 2024 •

edited

Loading