Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add replicator retry #3107

Merged

Conversation

fredcarle
Copy link
Collaborator

@fredcarle fredcarle commented Oct 7, 2024

Relevant issue(s)

Resolves #3072

Description

This PR adds a replication retry functionality to the database. It uses an exponential backoff until 32 minutes is reached and then it will continuously retry every 32 minutes until the inactive peer is removed from the list of replicators.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

make test

Specify the platform(s) on which this was tested:

  • MacOS

@fredcarle fredcarle added feature New feature or request area/p2p Related to the p2p networking system labels Oct 7, 2024
@fredcarle fredcarle added this to the DefraDB v0.14 milestone Oct 7, 2024
@fredcarle fredcarle requested a review from a team October 7, 2024 03:20
@fredcarle fredcarle self-assigned this Oct 7, 2024
@fredcarle fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 56d1f43 to 453f002 Compare October 7, 2024 03:23
Comment on lines +33 to +40
// exponential backoff retry intervals
time.Second * 30,
time.Minute,
time.Minute * 2,
time.Minute * 4,
time.Minute * 8,
time.Minute * 16,
time.Minute * 32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Interesting strat! Do we want the retry intervals to be configurable?

Copy link
Collaborator Author

@fredcarle fredcarle Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a ticket to make the retry configurable #3073. It is already configurable if devs use the go api though.

@fredcarle fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 453f002 to a64e3dc Compare October 7, 2024 03:30
Comment on lines +738 to +741
nodeIndex := i
if action.NodeID.HasValue() {
nodeIndex = action.NodeID.Value()
}
Copy link
Member

@shahzadlone shahzadlone Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info: When I fix the getNodes here #3076 won't have to do this hack anymore, I am guessing this is because the nodeIndex (0) is wrong when their is a non-zero nodeID specified correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's pretty much it.

Copy link

codecov bot commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 67.10875% with 124 lines in your changes missing coverage. Please review.

Project coverage is 80.00%. Comparing base (24a479f) to head (ccfbc27).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
internal/db/p2p_replicator.go 57.93% 78 Missing and 36 partials ⚠️
internal/core/key.go 75.76% 6 Missing and 2 partials ⚠️
http/client.go 0.00% 1 Missing ⚠️
http/client_tx.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3107      +/-   ##
===========================================
- Coverage    80.12%   80.00%   -0.12%     
===========================================
  Files          353      353              
  Lines        28175    28466     +291     
===========================================
+ Hits         22574    22772     +198     
- Misses        4019     4081      +62     
- Partials      1582     1613      +31     
Flag Coverage Δ
all-tests 80.00% <67.11%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
client/db.go 91.30% <ø> (ø)
datastore/multi.go 100.00% <100.00%> (+8.00%) ⬆️
event/event.go 100.00% <ø> (ø)
internal/db/collection.go 73.44% <ø> (-0.05%) ⬇️
internal/db/config.go 100.00% <100.00%> (ø)
internal/db/db.go 69.88% <100.00%> (+3.42%) ⬆️
internal/db/errors.go 64.15% <ø> (ø)
internal/db/messages.go 100.00% <100.00%> (+5.56%) ⬆️
net/client.go 100.00% <100.00%> (ø)
net/peer.go 79.38% <100.00%> (-1.14%) ⬇️
... and 5 more

... and 12 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24a479f...ccfbc27. Read the comment docs.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, but I have a handful of important todos for you whilst I continue my review.

func (s *server) pushLog(evt event.Update, pid peer.ID) error {
func (s *server) pushLog(evt event.Update, pid peer.ID) (err error) {
defer func() {
if err != nil && !evt.IsRetry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Please document why we are not publishing if it is a retry event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: the failure event could be used to queue another retry instead of using a channel on the event

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: the failure event could be used to queue another retry instead of using a channel on the event

The retry interval is set per peer and not per update. This is why the channel on the event is used to say if the retry was successful or not.

Creator: p.host.ID().String(),
Block: evt.Block,
}
if !evt.IsRetry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Please document why we are not publishing if it is a retry event.

p.server.mu.Lock()
reps, exists := p.server.replicators[lg.SchemaRoot]
p.server.mu.Unlock()

if exists {
for pid := range reps {
// Don't push if pid is in the list of peers for the topic.
// It will be handled by the pubsub system.
if _, ok := peers[pid.String()]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why has this been removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the pubsub system offers less guarantees than the direct to peer replicator system. There is no real downsides to having both reach the receiving peer. Also, if we rely on the pubsub system on updates, it will increase the difficulty of keeping track of failures for retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the explanation :)

net/server.go Outdated
if s.peer.ps == nil { // skip if we aren't running with a pubsub net
return nil
}
s.mu.Lock()
t, ok := s.topics[topic]
s.mu.Unlock()
if !ok {
err := s.addPubSubTopic(topic, false, nil)
subscribe := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The below is a little simpler IMO.

subscribe := topic != req.SchemaRoot && !s.hasPubSubTopic(req.SchemaRoot)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It's like this from a change I had made and didn't clean it up. Thanks for pointing it out.

@@ -585,3 +585,73 @@ func TestP2POneToOneReplicatorOrderIndependentDirectCreate(t *testing.T) {

testUtils.ExecuteTestCase(t, test)
}

func TestP2POneToOneReplicator_ManyDocsWithTargetNodeTemporarilyOffline_ShouldSucceed(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Thanks for this test, and the new utils stuff required to get it working, it is very easy to read.


retryIntervals []time.Duration
retryChan chan event.ReplicatorFailure
retryDone chan retryStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: db is quite a busy type, it might be worth rolling these 3 props up into a new type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"github.com/sourcenetwork/defradb/internal/merkle/clock"
)

const (
retryLoopInterval = 2 * time.Second
retryTimeout = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Without chasing down the usage of these properties it is quite hard to guess their difference. Please add some documentation to them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

func (db *db) handleReplicatorFailure(ctx context.Context, r event.ReplicatorFailure) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: It looks like the stuff done in this function (and child calls) should be protected by an internal transaction, otherwise we will have partial successes.

suggestion: When introducing the txn, I suggest not hosting the child functions on db as it makes it quite easy to accidentally bypass the txn, especially in the short term when we have no tests protecting against this.

Same comments apply to retryReplicators, and the r.Success half of handleCompletedReplicatorRetry.

if err != nil {
return err
}
if !exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Inverting this if would remove a level of indentation (and complexity) from the bulk of this function:

if exists {
  return
}
r := ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rInfo := retryInfo{}
err = cbor.Unmarshal(result.Value, &rInfo)
if err != nil {
log.ErrorContextE(ctx, "Failed to unmarshal replicator retry info", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: If this error is ever hit, it seems likely that it will not be the only record (programming error). I worry that if we do not delete the record in this block we will have a very rapidly growing database and log file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the bother, overall the PR looks good but I think the retry logic needs a little bit of work/documentation before it can be merged.


retryIntervals []time.Duration
retryChan chan event.ReplicatorFailure
retryDone chan retryStatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Please document the retry channels, it is not easy to understand how they work atm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return db.Peerstore().Put(ctx, key.ToDS(), b)
}

func (db *db) retryReplicator(ctx context.Context, peerID string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: The stuff within this function should be protected by a transaction scoped to this function, otherwise it can partially succeed.

thought: Partial success in this function may be desirable, in which case my preference is still to make that explicit through code, if not please document it.

log.ErrorContextE(ctx, "Failed to delete retry docID", err)
}
}
db.retryDone <- retryStatus{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I cannot guess why retryDone was handled via a channel - it looks like it is only ever written to from this function and atm could be changed to a simple function call.

If it has a good reason to be done like this please document it.

todo: It looks like you have a concurrency bug here, because you are processing 'done' like this, in the loop that also reads from the retryChan channel, you appear to have created a situation where retryDone may be written to, then a new retryChan record is written to for the same peer and processed (overwriting the retryInfo record), and then the retryDone chan-item is processed, deleting the re-written retryInfo and preventing it from being queried and it's docs retried.

switch active {
case true:
rep.Status = client.ReplicatorStatusActive
if rep.Status == client.ReplicatorStatusInactive {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: this will never evaluate to true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

}
case false:
rep.Status = client.ReplicatorStatusInactive
if rep.Status == client.ReplicatorStatusActive {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: this will never evaluate to true

func (s *server) pushLog(evt event.Update, pid peer.ID) error {
func (s *server) pushLog(evt event.Update, pid peer.ID) (err error) {
defer func() {
if err != nil && !evt.IsRetry {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: the failure event could be used to queue another retry instead of using a channel on the event

@shahzadlone
Copy link
Member

todo: you need to change the marked issue number in the description, it's linked to #3070 which is not accurate.

@fredcarle fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 5d16912 to c603799 Compare October 11, 2024 16:24
@fredcarle fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 01d7ff5 to ccfbc27 Compare October 11, 2024 17:25
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: I found this much easier to read, the retry stuff seems more linear, and the documentation is very clear.

It also looks like it will make it fairly straightforward to adjust if we want to change the way new retries and retry processing are handled/queued in the future if we chose to.

Thanks Fred :)

@fredcarle fredcarle merged commit 858f4f1 into sourcenetwork:develop Oct 11, 2024
42 of 43 checks passed
@fredcarle fredcarle deleted the fredcarle/feat/3072-replicator-retry branch October 11, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/p2p Related to the p2p networking system feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replicator basic retry
4 participants