New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: clickhouse optimize S3 #4825
base: develop
Are you sure you want to change the base?
Conversation
Co-authored-by: Prashant Shahi <[email protected]>
Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id> |
Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id> |
2 similar comments
Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id> |
Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with this is it's going to put system under so much pressure whenever it is triggered https://clickhouse.com/docs/en/optimize/avoidoptimizefinal. This can make the customer env the pod running effect everyone adversely. How big of the data we tested this on? Could this process be controlled with an upper limit on resources?
tables := []string{ | ||
"signoz_logs.logs", | ||
"signoz_logs.tag_attributes", | ||
"signoz_metrics.samples_v2", | ||
"signoz_metrics.time_series_v4", | ||
"signoz_metrics.time_series_v3", | ||
"signoz_metrics.time_series_v2", | ||
"signoz_traces.usage_explorer", | ||
"signoz_traces.span_attributes", | ||
"signoz_traces.dependency_graph_minutes", | ||
"signoz_traces.dependency_graph_minutes_v2", | ||
"signoz_traces.signoz_error_index_v2", | ||
"signoz_traces.signoz_index_v2", | ||
"signoz_traces.signoz_spans", | ||
"signoz_traces.durationSort", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this list prepared? It doesn't seem to be in full sync with
case constants.LogsTTL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we are not allowing user to change ttl or move to s3 for signoz_metrics.time_series_v4
or other time_series tables. Any reason for that?
We use |
Allow cluster to be a variable
IMO this is not one of those features where we get the initial working piece merged and think about it more later. It has the potential to bring down the entire system from ingestion, alerting and product usage. We know the mutations are the worst thing in ClickHouse. Even on the table with not more than a few tens of million rows, the CPU usage went near 100% which delayed ingestion leading to false alerts and the product was unusable during the whole time https://signoz-team.slack.com/archives/C06C5U3TUDP/p1706604434441239. Once it got into a bad state there was no way to stop it immediately (The kill wouldn't work). We had to wait till it was completed on its own. Did we test this on the system which had a non-trivial amount of data? What was the system resources usage change observed? How did it affect the rest of the product usage? My main point about the resource limits is that there should be sensible limits when we initiate an unscheduled merge. I didn't say we should limit resources too much. |
// General | ||
const ( | ||
CH_OPTIMIZE_INTERVAL_IN_HOURS = 24 | ||
CH_TIMEOUT_WAIT_IN_MINUTES = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not respected in case of slowdown
Summary
Clickhouse S3 optimizer runs OPTIMIZE TABLE query to reduce excessive PUT calls on S3
Co-authored-by: Prashant Shahi [email protected]