feat: Add replicator retry #3107

fredcarle · 2024-10-07T03:20:56Z

Relevant issue(s)

Resolves #3072

Description

This PR adds a replication retry functionality to the database. It uses an exponential backoff until 32 minutes is reached and then it will continuously retry every 32 minutes until the inactive peer is removed from the list of replicators.

Tasks

I made sure the code is well commented, particularly hard-to-understand areas.
I made sure the repository-held documentation is changed accordingly.
I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

make test

Specify the platform(s) on which this was tested:

MacOS

shahzadlone · 2024-10-07T03:28:27Z

internal/db/config.go

+			// exponential backoff retry intervals
+			time.Second * 30,
+			time.Minute,
+			time.Minute * 2,
+			time.Minute * 4,
+			time.Minute * 8,
+			time.Minute * 16,
+			time.Minute * 32,


question: Interesting strat! Do we want the retry intervals to be configurable?

I have a ticket to make the retry configurable #3073. It is already configurable if devs use the go api though.

shahzadlone · 2024-10-07T03:40:19Z

tests/integration/utils.go

+		nodeIndex := i
+		if action.NodeID.HasValue() {
+			nodeIndex = action.NodeID.Value()
+		}


info: When I fix the getNodes here #3076 won't have to do this hack anymore, I am guessing this is because the nodeIndex (0) is wrong when their is a non-zero nodeID specified correct?

yes that's pretty much it.

codecov · 2024-10-07T04:10:35Z

Codecov Report

Attention: Patch coverage is 67.10875% with 124 lines in your changes missing coverage. Please review.

Project coverage is 80.00%. Comparing base (24a479f) to head (ccfbc27).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
internal/db/p2p_replicator.go	57.93%	78 Missing and 36 partials ⚠️
internal/core/key.go	75.76%	6 Missing and 2 partials ⚠️
http/client.go	0.00%	1 Missing ⚠️
http/client_tx.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3107      +/-   ##
===========================================
- Coverage    80.12%   80.00%   -0.12%     
===========================================
  Files          353      353              
  Lines        28175    28466     +291     
===========================================
+ Hits         22574    22772     +198     
- Misses        4019     4081      +62     
- Partials      1582     1613      +31

Flag	Coverage Δ
all-tests	`80.00% <67.11%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
client/db.go	`91.30% <ø> (ø)`
datastore/multi.go	`100.00% <100.00%> (+8.00%)`	⬆️
event/event.go	`100.00% <ø> (ø)`
internal/db/collection.go	`73.44% <ø> (-0.05%)`	⬇️
internal/db/config.go	`100.00% <100.00%> (ø)`
internal/db/db.go	`69.88% <100.00%> (+3.42%)`	⬆️
internal/db/errors.go	`64.15% <ø> (ø)`
internal/db/messages.go	`100.00% <100.00%> (+5.56%)`	⬆️
net/client.go	`100.00% <100.00%> (ø)`
net/peer.go	`79.38% <100.00%> (-1.14%)`	⬇️
... and 5 more

... and 12 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24a479f...ccfbc27. Read the comment docs.

AndrewSisley

It looks good, but I have a handful of important todos for you whilst I continue my review.

AndrewSisley · 2024-10-07T12:47:59Z

net/client.go

-func (s *server) pushLog(evt event.Update, pid peer.ID) error {
+func (s *server) pushLog(evt event.Update, pid peer.ID) (err error) {
+	defer func() {
+		if err != nil && !evt.IsRetry {


todo: Please document why we are not publishing if it is a retry event.

thought: the failure event could be used to queue another retry instead of using a channel on the event

thought: the failure event could be used to queue another retry instead of using a channel on the event

The retry interval is set per peer and not per update. This is why the channel on the event is used to say if the retry was successful or not.

AndrewSisley · 2024-10-07T12:49:52Z

net/peer.go

-		Creator:    p.host.ID().String(),
-		Block:      evt.Block,
-	}
+	if !evt.IsRetry {


todo: Please document why we are not publishing if it is a retry event.

AndrewSisley · 2024-10-07T12:51:24Z

net/peer.go

 	p.server.mu.Lock()
 	reps, exists := p.server.replicators[lg.SchemaRoot]
 	p.server.mu.Unlock()

 	if exists {
 		for pid := range reps {
-			// Don't push if pid is in the list of peers for the topic.
-			// It will be handled by the pubsub system.
-			if _, ok := peers[pid.String()]; ok {


question: Why has this been removed?

Because the pubsub system offers less guarantees than the direct to peer replicator system. There is no real downsides to having both reach the receiving peer. Also, if we rely on the pubsub system on updates, it will increase the difficulty of keeping track of failures for retry.

Makes sense, thanks for the explanation :)

AndrewSisley · 2024-10-07T12:52:47Z

net/server.go

 	if s.peer.ps == nil { // skip if we aren't running with a pubsub net
 		return nil
 	}
 	s.mu.Lock()
 	t, ok := s.topics[topic]
 	s.mu.Unlock()
 	if !ok {
-		err := s.addPubSubTopic(topic, false, nil)
+		subscribe := false


suggestion: The below is a little simpler IMO.

subscribe := topic != req.SchemaRoot && !s.hasPubSubTopic(req.SchemaRoot)

I agree. It's like this from a change I had made and didn't clean it up. Thanks for pointing it out.

AndrewSisley · 2024-10-07T12:56:56Z

tests/integration/net/simple/replicator/with_create_test.go

@@ -585,3 +585,73 @@ func TestP2POneToOneReplicatorOrderIndependentDirectCreate(t *testing.T) {

 	testUtils.ExecuteTestCase(t, test)
 }
+
+func TestP2POneToOneReplicator_ManyDocsWithTargetNodeTemporarilyOffline_ShouldSucceed(t *testing.T) {


praise: Thanks for this test, and the new utils stuff required to get it working, it is very easy to read.

AndrewSisley · 2024-10-07T13:02:35Z

internal/db/db.go

+
+	retryIntervals []time.Duration
+	retryChan      chan event.ReplicatorFailure
+	retryDone      chan retryStatus


suggestion: db is quite a busy type, it might be worth rolling these 3 props up into a new type.

AndrewSisley · 2024-10-07T13:05:12Z

internal/db/p2p_replicator.go

 	"github.com/sourcenetwork/defradb/internal/merkle/clock"
 )

+const (
+	retryLoopInterval = 2 * time.Second
+	retryTimeout      = 10 * time.Second


todo: Without chasing down the usage of these properties it is quite hard to guess their difference. Please add some documentation to them.

AndrewSisley · 2024-10-07T13:10:29Z

internal/db/p2p_replicator.go

+	}
+}
+
+func (db *db) handleReplicatorFailure(ctx context.Context, r event.ReplicatorFailure) error {


todo: It looks like the stuff done in this function (and child calls) should be protected by an internal transaction, otherwise we will have partial successes.

suggestion: When introducing the txn, I suggest not hosting the child functions on db as it makes it quite easy to accidentally bypass the txn, especially in the short term when we have no tests protecting against this.

Same comments apply to retryReplicators, and the r.Success half of handleCompletedReplicatorRetry.

AndrewSisley · 2024-10-07T13:16:33Z

internal/db/p2p_replicator.go

+	if err != nil {
+		return err
+	}
+	if !exists {


suggestion: Inverting this if would remove a level of indentation (and complexity) from the bulk of this function:

if exists { return } r := ...

AndrewSisley · 2024-10-07T13:19:17Z

internal/db/p2p_replicator.go

+		rInfo := retryInfo{}
+		err = cbor.Unmarshal(result.Value, &rInfo)
+		if err != nil {
+			log.ErrorContextE(ctx, "Failed to unmarshal replicator retry info", err)


thought: If this error is ever hit, it seems likely that it will not be the only record (programming error). I worry that if we do not delete the record in this block we will have a very rapidly growing database and log file.

AndrewSisley

Sorry about the bother, overall the PR looks good but I think the retry logic needs a little bit of work/documentation before it can be merged.

AndrewSisley · 2024-10-07T13:33:28Z

internal/db/db.go

+
+	retryIntervals []time.Duration
+	retryChan      chan event.ReplicatorFailure
+	retryDone      chan retryStatus


todo: Please document the retry channels, it is not easy to understand how they work atm.

AndrewSisley · 2024-10-07T13:36:11Z

internal/db/p2p_replicator.go

+	return db.Peerstore().Put(ctx, key.ToDS(), b)
+}
+
+func (db *db) retryReplicator(ctx context.Context, peerID string) {


todo: The stuff within this function should be protected by a transaction scoped to this function, otherwise it can partially succeed.

thought: Partial success in this function may be desirable, in which case my preference is still to make that explicit through code, if not please document it.

AndrewSisley · 2024-10-07T13:48:52Z

internal/db/p2p_replicator.go

+			log.ErrorContextE(ctx, "Failed to delete retry docID", err)
+		}
+	}
+	db.retryDone <- retryStatus{


todo: I cannot guess why retryDone was handled via a channel - it looks like it is only ever written to from this function and atm could be changed to a simple function call.

If it has a good reason to be done like this please document it.

todo: It looks like you have a concurrency bug here, because you are processing 'done' like this, in the loop that also reads from the retryChan channel, you appear to have created a situation where retryDone may be written to, then a new retryChan record is written to for the same peer and processed (overwriting the retryInfo record), and then the retryDone chan-item is processed, deleting the re-written retryInfo and preventing it from being queried and it's docs retried.

nasdf · 2024-10-07T15:53:51Z

internal/db/p2p_replicator.go

+	switch active {
+	case true:
+		rep.Status = client.ReplicatorStatusActive
+		if rep.Status == client.ReplicatorStatusInactive {


todo: this will never evaluate to true

Good catch.

✅

nasdf · 2024-10-07T15:53:57Z

internal/db/p2p_replicator.go

+		}
+	case false:
+		rep.Status = client.ReplicatorStatusInactive
+		if rep.Status == client.ReplicatorStatusActive {


todo: this will never evaluate to true

nasdf · 2024-10-07T16:09:06Z

net/client.go

-func (s *server) pushLog(evt event.Update, pid peer.ID) error {
+func (s *server) pushLog(evt event.Update, pid peer.ID) (err error) {
+	defer func() {
+		if err != nil && !evt.IsRetry {


thought: the failure event could be used to queue another retry instead of using a channel on the event

shahzadlone · 2024-10-08T08:36:58Z

todo: you need to change the marked issue number in the description, it's linked to #3070 which is not accurate.

AndrewSisley

praise: I found this much easier to read, the retry stuff seems more linear, and the documentation is very clear.

It also looks like it will make it fairly straightforward to adjust if we want to change the way new retries and retry processing are handled/queued in the future if we chose to.

Thanks Fred :)

fredcarle added feature New feature or request area/p2p Related to the p2p networking system labels Oct 7, 2024

fredcarle added this to the DefraDB v0.14 milestone Oct 7, 2024

fredcarle requested a review from a team October 7, 2024 03:20

fredcarle self-assigned this Oct 7, 2024

fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 56d1f43 to 453f002 Compare October 7, 2024 03:23

shahzadlone reviewed Oct 7, 2024

View reviewed changes

fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 453f002 to a64e3dc Compare October 7, 2024 03:30

shahzadlone reviewed Oct 7, 2024

View reviewed changes

AndrewSisley requested changes Oct 7, 2024

View reviewed changes

nasdf requested changes Oct 7, 2024

View reviewed changes

fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 5d16912 to c603799 Compare October 11, 2024 16:24

fredcarle requested review from nasdf and AndrewSisley October 11, 2024 16:25

nasdf approved these changes Oct 11, 2024

View reviewed changes

fredcarle added 7 commits October 11, 2024 13:25

add replicator retry

67bc7d7

update openAPI json

4a6630b

fix pubsub topic subscription

8383a65

improve retry process flow

14b031e

Add documentation to error handling with isRetry

541a6e9

add documentation for why we don’t publish on retry

3e20322

fix test and add variant

ccfbc27

fredcarle force-pushed the fredcarle/feat/3072-replicator-retry branch from 01d7ff5 to ccfbc27 Compare October 11, 2024 17:25

AndrewSisley approved these changes Oct 11, 2024

View reviewed changes

fredcarle merged commit 858f4f1 into sourcenetwork:develop Oct 11, 2024
42 of 43 checks passed

fredcarle deleted the fredcarle/feat/3072-replicator-retry branch October 11, 2024 17:42

feat: Add replicator retry #3107

feat: Add replicator retry #3107

Conversation

fredcarle commented Oct 7, 2024 • edited Loading

Relevant issue(s)

Description

Tasks

How has this been tested?

Choose a reason for hiding this comment

fredcarle Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

shahzadlone Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 7, 2024 • edited Loading

Codecov Report

AndrewSisley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSisley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shahzadlone commented Oct 8, 2024

AndrewSisley left a comment

Choose a reason for hiding this comment

fredcarle commented Oct 7, 2024 •

edited

Loading

fredcarle Oct 7, 2024 •

edited

Loading

shahzadlone Oct 7, 2024 •

edited

Loading

codecov bot commented Oct 7, 2024 •

edited

Loading