Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: clickhouse optimize S3 #4825

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from
Open

Conversation

makeavish
Copy link
Member

@makeavish makeavish commented Apr 6, 2024

Summary

Clickhouse S3 optimizer runs OPTIMIZE TABLE query to reduce excessive PUT calls on S3


Co-authored-by: Prashant Shahi [email protected]

Co-authored-by: Prashant Shahi <[email protected]>
@github-actions github-actions bot added the enhancement New feature or request label Apr 6, 2024
Copy link

github-actions bot commented Apr 6, 2024

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

@prashant-shahi prashant-shahi marked this pull request as ready for review April 10, 2024 09:34
Copy link

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

2 similar comments
Copy link

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

Copy link

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

Copy link
Member

@srikanthccv srikanthccv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this is it's going to put system under so much pressure whenever it is triggered https://clickhouse.com/docs/en/optimize/avoidoptimizefinal. This can make the customer env the pod running effect everyone adversely. How big of the data we tested this on? Could this process be controlled with an upper limit on resources?

pkg/query-service/app/clickhouseOptimizeS3/optimize.go Outdated Show resolved Hide resolved
Comment on lines 98 to 113
tables := []string{
"signoz_logs.logs",
"signoz_logs.tag_attributes",
"signoz_metrics.samples_v2",
"signoz_metrics.time_series_v4",
"signoz_metrics.time_series_v3",
"signoz_metrics.time_series_v2",
"signoz_traces.usage_explorer",
"signoz_traces.span_attributes",
"signoz_traces.dependency_graph_minutes",
"signoz_traces.dependency_graph_minutes_v2",
"signoz_traces.signoz_error_index_v2",
"signoz_traces.signoz_index_v2",
"signoz_traces.signoz_spans",
"signoz_traces.durationSort",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this list prepared? It doesn't seem to be in full sync with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we are not allowing user to change ttl or move to s3 for signoz_metrics.time_series_v4 or other time_series tables. Any reason for that?

pkg/query-service/app/clickhouseOptimizeS3/optimize.go Outdated Show resolved Hide resolved
@makeavish
Copy link
Member Author

My main concern with this is it's going to put system under so much pressure whenever it is triggered https://clickhouse.com/docs/en/optimize/avoidoptimizefinal. This can make the customer env the pod running effect everyone adversely. How big of the data we tested this on? Could this process be controlled with an upper limit on resources?

We use optimize_skip_merged_partitions=1 which ensures that already merged partitions are not merged. Using upper limit on resources can be done but depends on load. Limiting resources too much can slow down merges eventually resulting in slowdown of queries. I would suggest we release this and then manually adjust resource limits based on load. After some feedbacks we can also automate resource limits.

@srikanthccv
Copy link
Member

srikanthccv commented Apr 16, 2024

IMO this is not one of those features where we get the initial working piece merged and think about it more later. It has the potential to bring down the entire system from ingestion, alerting and product usage. We know the mutations are the worst thing in ClickHouse. Even on the table with not more than a few tens of million rows, the CPU usage went near 100% which delayed ingestion leading to false alerts and the product was unusable during the whole time https://signoz-team.slack.com/archives/C06C5U3TUDP/p1706604434441239. Once it got into a bad state there was no way to stop it immediately (The kill wouldn't work). We had to wait till it was completed on its own.

Did we test this on the system which had a non-trivial amount of data? What was the system resources usage change observed? How did it affect the rest of the product usage? My main point about the resource limits is that there should be sensible limits when we initiate an unscheduled merge. I didn't say we should limit resources too much.

// General
const (
CH_OPTIMIZE_INTERVAL_IN_HOURS = 24
CH_TIMEOUT_WAIT_IN_MINUTES = 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not respected in case of slowdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants