You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since people have been asking about Refinery futures, here's a rough roadmap. Please note that there are no version numbers.
We're not expecting any new major releases of Refinery
The thing we had called "Refinery 3", which was being built to enable full dynamic autoscaling, turned out to be an experiment that won't be released. It worked well in smaller clusters and we were using it internally, but it turned out that part of its design didn't scale well to large clusters. But by implementing it that far, we also learned enough to roll the key functionality out in a series of minor releases that won't require breaking changes to configuration.
Dynamic Scaling is being released in smaller steps
The goal is to enable Refinery to scale dynamically. We've been accomplishing this in separate steps, each of which is a separate Refinery minor version bump.
Pubsub (Released in v2.7)
Start using a Publish/Subscribe (pubsub) system to allow the Refinery cluster to quickly communicate new information and thus act as a unit. Health, stress, peer membership, and config change notifications will all be communicated through pubsub. The pubsub system is abstracted, but the first implementation is on Redis since that's what Refinery was already using for peer communications. This release will also include readiness checks as well as liveness checks to play better with load balancers.
This work provides the foundation that will allow us to do dynamic scaling, and it also improves stability and data quality because the entire cluster reacts to changes as a single unit.
Smarter scaling (Released in 2.8)
When cluster size changes, the Refinery cluster will redistribute spans in flight to the new correct nodes; this will mean that scaling events will not break traces in progress. This should allow refinery to be scaled without significant disruption in telemetry.
Reduction in peer-to-peer data communications (experimental in 2.9)
This is also known as "Elimination of Trace Locality".
When a span is received on Refinery A but Refinery B is the designated decider, instead of forwarding the entire span to B, Refinery A will forward only the key fields (those fields used in the sampling decision) to B, and retain the original span. Once B makes the decision, it will publish the decision so that A can appropriately keep or send the span. This should significantly reduce the amount of traffic traveling between peers, as only key fields and traceIDs will be sent rather than entire spans.
The additional benefit of this is that no single refinery bears the burden of large traces; instead, the memory usage in the entire cluster will go up and down together. This should help with autoscaling based on memory size.
However, we've made it an experimental feature in 2.9 because it requires a much larger Redis cluster to handle the trace traffic, and because we are still learning how it behaves at the largest scales.
Autoscaling
We are hopeful that with all of the work listed above, that it will then be possible to autoscale refinery in Kubernetes based on memory usage. But we don't know that until we get there, so more to come as we learn more.
Log Sampling
As of v2.6, OTel logs containing traceIDs will be sampled alongside traces using the same algorithms. In future releases, we will be enhancing these capabilities (details still TBD).
The text was updated successfully, but these errors were encountered:
Since people have been asking about Refinery futures, here's a rough roadmap. Please note that there are no version numbers.
We're not expecting any new major releases of Refinery
The thing we had called "Refinery 3", which was being built to enable full dynamic autoscaling, turned out to be an experiment that won't be released. It worked well in smaller clusters and we were using it internally, but it turned out that part of its design didn't scale well to large clusters. But by implementing it that far, we also learned enough to roll the key functionality out in a series of minor releases that won't require breaking changes to configuration.
Dynamic Scaling is being released in smaller steps
The goal is to enable Refinery to scale dynamically. We've been accomplishing this in separate steps, each of which is a separate Refinery minor version bump.
Pubsub (Released in v2.7)
Start using a Publish/Subscribe (pubsub) system to allow the Refinery cluster to quickly communicate new information and thus act as a unit. Health, stress, peer membership, and config change notifications will all be communicated through pubsub. The pubsub system is abstracted, but the first implementation is on Redis since that's what Refinery was already using for peer communications. This release will also include readiness checks as well as liveness checks to play better with load balancers.
This work provides the foundation that will allow us to do dynamic scaling, and it also improves stability and data quality because the entire cluster reacts to changes as a single unit.
Smarter scaling (Released in 2.8)
When cluster size changes, the Refinery cluster will redistribute spans in flight to the new correct nodes; this will mean that scaling events will not break traces in progress. This should allow refinery to be scaled without significant disruption in telemetry.
Reduction in peer-to-peer data communications (experimental in 2.9)
This is also known as "Elimination of Trace Locality".
When a span is received on Refinery A but Refinery B is the designated decider, instead of forwarding the entire span to B, Refinery A will forward only the key fields (those fields used in the sampling decision) to B, and retain the original span. Once B makes the decision, it will publish the decision so that A can appropriately keep or send the span. This should significantly reduce the amount of traffic traveling between peers, as only key fields and traceIDs will be sent rather than entire spans.
The additional benefit of this is that no single refinery bears the burden of large traces; instead, the memory usage in the entire cluster will go up and down together. This should help with autoscaling based on memory size.
However, we've made it an experimental feature in 2.9 because it requires a much larger Redis cluster to handle the trace traffic, and because we are still learning how it behaves at the largest scales.
Autoscaling
We are hopeful that with all of the work listed above, that it will then be possible to autoscale refinery in Kubernetes based on memory usage. But we don't know that until we get there, so more to come as we learn more.
Log Sampling
As of v2.6, OTel logs containing traceIDs will be sampled alongside traces using the same algorithms. In future releases, we will be enhancing these capabilities (details still TBD).
The text was updated successfully, but these errors were encountered: