You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
High IO load on historical nodes during one of nodes upgrade/restart
Affected Version
31.0.0
smartSegmentLoading = true
Description
When changing the configuration or upgrading the version we restart historical nodes one by one, waiting until the node being restarted becomes available (become registered in Zookeeper). We have replication factor = 2. But it looks like coordinator immediatelly assigns load tasks on the remaining historicals which causes high IO load on almost all running historicals and deep storage (we use cephfs). In previous versions of Druid we didn't see such behavior, redundancy recovered much more slowly.
Is there a way to tell coordinator to make a delay with redundancy recovery ?
The only coordinator's parameter related to such sitiation that I see is replicationThrottleLimit, but it does not prevent the redundancy recovery load queue from appearing. There must be another setting to delay recovery completely (do not send load tasks to historicals).
The text was updated successfully, but these errors were encountered:
Thanks for reporting this issue, @Z9n2JktHlZDmlhSvqc9X2MmL3BwQG7tk !
When smartSegmentLoading is set to true, the value of replicationThrottleLimit provided in the coordinator's dynamic is essentially ignored. Smart segment loading automatically calculates the value of replicationThrottleLimit to be 5% of the total number of "used" segments in the cluster.
So, if you have 1000 used segments in the cluster, replication throttle limit would be calculated as 50.
For the time being, you could try setting these:
smartSegmentLoading: false
maxSegmentsInNodeLoadingQueue: 0
replicationThrottleLimit: 10 (or any other value that suits you depending on the total number of segments in your cluster)
However, in the long run, we would like to improve the computation of replicationThrottleLimit performed by the Coordinator even when smartSegmentLoading is set to true.
Edit: Or at the very least, we can continue to honor the value of replicationThrottleLimit provided by the user even if smartSegmentLoading is true.
High IO load on historical nodes during one of nodes upgrade/restart
Affected Version
31.0.0
smartSegmentLoading = true
Description
When changing the configuration or upgrading the version we restart
historical
nodes one by one, waiting until the node being restarted becomes available (become registered in Zookeeper). We have replication factor = 2. But it looks likecoordinator
immediatelly assigns load tasks on the remaininghistoricals
which causes high IO load on almost all runninghistoricals
and deep storage (we usecephfs
). In previous versions ofDruid
we didn't see such behavior, redundancy recovered much more slowly.Is there a way to tell
coordinator
to make a delay with redundancy recovery ?The only
coordinator's
parameter related to such sitiation that I see isreplicationThrottleLimit
, but it does not prevent the redundancy recovery load queue from appearing. There must be another setting to delay recovery completely (do not send load tasks tohistoricals
).The text was updated successfully, but these errors were encountered: