Avoiding rebalance on historical restart #17594

Z9n2JktHlZDmlhSvqc9X2MmL3BwQG7tk · 2024-12-21T12:11:31Z

High IO load on historical nodes during one of nodes upgrade/restart

Affected Version

31.0.0
smartSegmentLoading = true

Description

When changing the configuration or upgrading the version we restart historical nodes one by one, waiting until the node being restarted becomes available (become registered in Zookeeper). We have replication factor = 2. But it looks like coordinator immediatelly assigns load tasks on the remaining historicals which causes high IO load on almost all running historicals and deep storage (we use cephfs). In previous versions of Druid we didn't see such behavior, redundancy recovered much more slowly.

Is there a way to tell coordinator to make a delay with redundancy recovery ?

The only coordinator's parameter related to such sitiation that I see is replicationThrottleLimit, but it does not prevent the redundancy recovery load queue from appearing. There must be another setting to delay recovery completely (do not send load tasks to historicals).

The text was updated successfully, but these errors were encountered:

kfaraz · 2024-12-21T12:50:18Z

Thanks for reporting this issue, @Z9n2JktHlZDmlhSvqc9X2MmL3BwQG7tk !
When smartSegmentLoading is set to true, the value of replicationThrottleLimit provided in the coordinator's dynamic is essentially ignored. Smart segment loading automatically calculates the value of replicationThrottleLimit to be 5% of the total number of "used" segments in the cluster.

So, if you have 1000 used segments in the cluster, replication throttle limit would be calculated as 50.

For the time being, you could try setting these:

smartSegmentLoading: false
maxSegmentsInNodeLoadingQueue: 0
replicationThrottleLimit: 10 (or any other value that suits you depending on the total number of segments in your cluster)

However, in the long run, we would like to improve the computation of replicationThrottleLimit performed by the Coordinator even when smartSegmentLoading is set to true.

Edit: Or at the very least, we can continue to honor the value of replicationThrottleLimit provided by the user even if smartSegmentLoading is true.

Z9n2JktHlZDmlhSvqc9X2MmL3BwQG7tk added the Uncategorized problem report label Dec 21, 2024

kfaraz self-assigned this Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding rebalance on historical restart #17594

Avoiding rebalance on historical restart #17594

Z9n2JktHlZDmlhSvqc9X2MmL3BwQG7tk commented Dec 21, 2024 •

edited

Loading

kfaraz commented Dec 21, 2024 •

edited

Loading

Avoiding rebalance on historical restart #17594

Avoiding rebalance on historical restart #17594

Comments

Z9n2JktHlZDmlhSvqc9X2MmL3BwQG7tk commented Dec 21, 2024 • edited Loading

Affected Version

Description

kfaraz commented Dec 21, 2024 • edited Loading

Z9n2JktHlZDmlhSvqc9X2MmL3BwQG7tk commented Dec 21, 2024 •

edited

Loading

kfaraz commented Dec 21, 2024 •

edited

Loading