You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 17, 2018. It is now read-only.
I mentioned this issue to you a while back. Basically the aggregator currently marks a timestamp as "flushed" and persist that timestamp in KV as soon as the metrics with that timestamp have been flushed to either the backends (m3msg ingesters/indexers/etc) or written out to the TCP connection to other aggregation servers as forwarded metrics. However, without acknowledgements there's no reliable way to know for sure whether the metrics have made their way to the receiver end and as such marking tiles as completed can be premature and in turn cause the followers to discard metrics too early and can cause data loss during server deployments.
With the integration of m3msg into m3aggregator, this should be an achievable goal. Basically when a timestamp is flushed, the timestamp should not be marked as completed until the metrics associated with that timestamp have been acked on the other side (or dropped locally due to buffer full) so we can mark metrics as written with confidence.
In the short term, a workaround to mitigate the issue for forwarded metrics could be for the follower to use lastFlushedNanos - maxSingleDelay as the target timestamp to discard its metrics, as for forwarded metrics they would be rejected after maxSingleDelay anyway. Nonetheless, this is certainly not ideal, and using m3msg based acks would be a much cleaner solution.
The text was updated successfully, but these errors were encountered:
cc @cw9
I mentioned this issue to you a while back. Basically the aggregator currently marks a timestamp as "flushed" and persist that timestamp in KV as soon as the metrics with that timestamp have been flushed to either the backends (m3msg ingesters/indexers/etc) or written out to the TCP connection to other aggregation servers as forwarded metrics. However, without acknowledgements there's no reliable way to know for sure whether the metrics have made their way to the receiver end and as such marking tiles as completed can be premature and in turn cause the followers to discard metrics too early and can cause data loss during server deployments.
With the integration of m3msg into m3aggregator, this should be an achievable goal. Basically when a timestamp is flushed, the timestamp should not be marked as completed until the metrics associated with that timestamp have been acked on the other side (or dropped locally due to buffer full) so we can mark metrics as written with confidence.
In the short term, a workaround to mitigate the issue for forwarded metrics could be for the follower to use
lastFlushedNanos - maxSingleDelay
as the target timestamp to discard its metrics, as for forwarded metrics they would be rejected aftermaxSingleDelay
anyway. Nonetheless, this is certainly not ideal, and using m3msg based acks would be a much cleaner solution.The text was updated successfully, but these errors were encountered: