Refinery's Near-Term Roadmap #1235

kentquirk · 2024-07-18T15:05:46Z

Since people have been asking about Refinery futures, here's a rough roadmap. Please note that there are no version numbers.

We're not expecting any new major releases of Refinery

The thing we had called "Refinery 3", which was being built to enable full dynamic autoscaling, turned out to be an experiment that won't be released. It worked well in smaller clusters and we were using it internally, but it turned out that part of its design didn't scale well to large clusters. But by implementing it that far, we also learned enough to roll the key functionality out in a series of minor releases that won't require breaking changes to configuration.

Dynamic Scaling is being released in smaller steps

The goal is to enable Refinery to scale dynamically. We've been accomplishing this in separate steps, each of which is a separate Refinery minor version bump.

Pubsub (Released in v2.7)

Start using a Publish/Subscribe (pubsub) system to allow the Refinery cluster to quickly communicate new information and thus act as a unit. Health, stress, peer membership, and config change notifications will all be communicated through pubsub. The pubsub system is abstracted, but the first implementation is on Redis since that's what Refinery was already using for peer communications. This release will also include readiness checks as well as liveness checks to play better with load balancers.

This work provides the foundation that will allow us to do dynamic scaling, and it also improves stability and data quality because the entire cluster reacts to changes as a single unit.

Smarter scaling (Released in 2.8)

When cluster size changes, the Refinery cluster will redistribute spans in flight to the new correct nodes; this will mean that scaling events will not break traces in progress. This should allow refinery to be scaled without significant disruption in telemetry.

Reduction in peer-to-peer data communications (experimental in 2.9)

This is also known as "Elimination of Trace Locality".

When a span is received on Refinery A but Refinery B is the designated decider, instead of forwarding the entire span to B, Refinery A will forward only the key fields (those fields used in the sampling decision) to B, and retain the original span. Once B makes the decision, it will publish the decision so that A can appropriately keep or send the span. This should significantly reduce the amount of traffic traveling between peers, as only key fields and traceIDs will be sent rather than entire spans.

The additional benefit of this is that no single refinery bears the burden of large traces; instead, the memory usage in the entire cluster will go up and down together. This should help with autoscaling based on memory size.

However, we've made it an experimental feature in 2.9 because it requires a much larger Redis cluster to handle the trace traffic, and because we are still learning how it behaves at the largest scales.

Autoscaling

We are hopeful that with all of the work listed above, that it will then be possible to autoscale refinery in Kubernetes based on memory usage. But we don't know that until we get there, so more to come as we learn more.

Log Sampling

As of v2.6, OTel logs containing traceIDs will be sampled alongside traces using the same algorithms. In future releases, we will be enhancing these capabilities (details still TBD).

kentquirk added the type: discussion Requests for comments, discussions about possible enhancements. label Jul 18, 2024

kentquirk self-assigned this Jul 18, 2024

kentquirk pinned this issue Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refinery's Near-Term Roadmap #1235

Refinery's Near-Term Roadmap #1235

kentquirk commented Jul 18, 2024 •

edited

Loading

Refinery's Near-Term Roadmap #1235

Refinery's Near-Term Roadmap #1235

Comments

kentquirk commented Jul 18, 2024 • edited Loading

We're not expecting any new major releases of Refinery

Dynamic Scaling is being released in smaller steps

Pubsub (Released in v2.7)

Smarter scaling (Released in 2.8)

Reduction in peer-to-peer data communications (experimental in 2.9)

Autoscaling

Log Sampling

kentquirk commented Jul 18, 2024 •

edited

Loading