Full rework of the BlockFetch logic for bulk sync mode #1179

Niols · 2024-07-03T06:38:11Z

Integrates a new implementation of the BulkSync mode, where blocks are downloaded from alternative peers as soon as the node has no more blocks to validate while there are longstanding requests in flight.

This PR depends on the new implementation of the BulkSync mode (IntersectMBO/ouroboros-network#4919). cabal.project is made to point to a back-port of the BulkSync implementation on ouroboros-network-0.16.1.1.

CSJ Changes

CSJ is involved because the new BulkSync mode requires to change the dynamo if it is also serving blocks, and it is not sending them promptly enough. The dynamo choice has an influence in the blocks that are chosen to be downloaded by BlockFetch.

For this sake, b93c379 gives the ability to order the ChainSync clients, so the dynamo role can be rotated among them whenever BlockFetch requests it.

b1c0bf8 provides the implementation of the rotation operation.

BlockFetch tests

c4bfa37 allows to specify in tests in which order to start the peers, which has an effect on what peer is chosen as initial dynamo.

c594c09 in turn adds a new BlockFetch test to show that syncing isn't slowed down by peers that don't send blocks.

Integration of BlockFetch changes

The collection of ChainSync client handles now needs to be passed between BlockFetch and ChainSync so dynamo rotations can be requested by BlockFetch.

The parameter bfcMaxConcurrencyBulkSync has been removed since blocks are not coordinated to be downloaded concurrently.

These changes are in 6926278.

ChainSel changes

Now BlockFetch requires the ability to detect if ChainSel has run out of blocks to validate. This motivates 73187ba, which implements a mechanism to measure if ChainSel is waiting for more blocks (starves), and determines for how long.

The above change is not sufficient to measure starvation. The queue to send blocks for validation used to allow only for one block to sit in the queue. This would interfere with the ability to measure starvation since BlockFetch would block waiting for the queue to become empty, and the queue would quickly become empty after taking just 1 block. For download delays to be amortized, a larger queue capacity was needed. This is the reason why a fix to IntersectMBO/ouroboros-network#2721 was ported in 0d3fc28.

Miscellaneous fixes

CSJ jump size adjustment

When syncing from mainnet, we discovered that CSJ wouldn't sync the blocks from the Byron era. This was because the jump size was set to the length of the genesis window of the Shelley era, which is much larger than Byron's. When the jump size is larger than the genesis window, the dynamo will block on the forecast horizon before offering a jump that allows the chain selection to advance. In this case, CSJ and chain selection will deadlock.

For this reason we set the default jump size to the size of Byron's genesis window in 028883a. This didn't show an impact on syncing time in our measures. Future work (as part of deploying Genesis) might involve allowing the jump size to vary between different eras.

GDD rate limit

GDD evaluation showed an overhead of 10% if run after every header arrives via ChainSync. Therefore, in b7fa122 we limited how often it could run, so multiple header arrivals could be handled by a single GDD evaluation.

Candidate fragment comparison in the ChainSync client

We stumbled upon a test case where the candidate fragments of the dynamo and an objector were no longer than the current selection (both peers were adversarial). This was problematic because BlockFetch would refuse to download blocks from these candidates, and ChainSync in turn would wait for the selection to advance in order to download more headers.

The fix in e27a73c is to have the ChainSync client disconnect a peer which is about to block on the forecast horizon if its candidate isn't better than the selection.

Candidate fragment truncations

At the moment, it is possible for a candidate fragment to be truncated by CSJ when a jumper jumps to a point that is not younger than the tip of its current candidate fragment. We encountered tests where the jump point could be so old that it would fall behind the immutable tip, and GDD would ignore the peer when computing the Limit on Eagerness. This in turn would cause the selection to advance into potentially adversarial chains.

The fix in dc5f6f7 is to have GDD never drop candidates. When the candidate does not intersect the current selection, the LoE is not advanced. This is a situation guaranteed to be unblocked by the ChainSync client since it will either disconnect the peer or bring the candidate to intersect with the current selection.

ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs

amesgen · 2024-08-06T12:46:43Z

ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs

+-- REVIEW: What about all the threads that are waiting to write in the queue and
+-- will write after the flush?!


Indeed, there could be short time after the queue is flushed, but before the ChainDB is closed (and hence nothing new can be added to the ChainDB). I think this already exists on main.

I raised this on the IOG Slack, but I don't think we need to do anything about it in this PR.

In that case I'd say: let's add a note and remove the REVIEW item.

ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs

dnadales

LGTM: please remove any pending REVIEW comments.

dnadales · 2024-09-09T12:07:49Z

...os-consensus-diffusion/src/ouroboros-consensus-diffusion/Ouroboros/Consensus/Node/Genesis.hs

+  , gcLoEAndGDDConfig          :: !(LoEAndGDDConfig LoEAndGDDParams)
+  } deriving stock (Eq, Generic, Show)
+
+-- | Genesis configuration flags and low-level args, as parsed from config file or CLI


It might help to add comments and examples of these flags and their effect on the node's behaviour. It will surely help whoever integrates this with the CLI/Node.

dnadales · 2024-09-09T12:10:05Z

...os-consensus-diffusion/src/ouroboros-consensus-diffusion/Ouroboros/Consensus/Node/Genesis.hs

+    defaultCapacity              = 100_000 -- number of tokens
+    defaultRate                  = 500 -- tokens per second leaking, 1/2ms
+    -- 3 * 2160 * 20 works in more recent ranges of slots, but causes syncing to
+    -- block in byron.


Suggested change

-- block in byron.

-- block in Byron.

dnadales · 2024-09-09T12:16:38Z

ouroboros-consensus-diffusion/test/consensus-test/Test/Consensus/Genesis/Setup/GenChains.hs

-    -- values carry no special meaning. Someone needs to think about what values
-    -- would make for interesting tests.
+    gtLoPBucketParams = LoPBucketParams { lbpCapacity = 50, lbpRate = 10 },
+    -- ^ REVIEW: Do we want to generate those randomly?


If we do not generate these randomly we should simply add a note that says that we could, but we should remove any pending REVIEWs from the code 🙏

dnadales · 2024-09-09T12:22:15Z

...nsensus/src/ouroboros-consensus/Ouroboros/Consensus/MiniProtocol/ChainSync/Client/Jumping.hs

+  ChainSyncClientHandleCollection peer m blk ->
+  peer ->
+  m ()
+  -- STM m (Maybe (peer, ChainSyncClientHandle m blk))


Dangling comment.

dnadales · 2024-09-09T12:25:54Z

...boros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/ChainSel.hs

-    -- Otherwise, the BlockFetch client would have to wait for
-    -- 'chainSelectionForFutureBlocks'.
+    --
+    -- Note: we call 'chainSelectionForFutureBlocks' in all branches instead of


Capitalize NOTE?

dnadales · 2024-09-09T12:27:09Z

ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/Impl/Types.hs

+-- REVIEW: What about all the threads that are waiting to write in the queue and
+-- will write after the flush?!


In that case I'd say: let's add a note and remove the REVIEW item.

…nd a queue

* Addition of ChainSyncClientHandleCollection, grace period, and starvation event in BlockFetch * Plug `rotateDynamo` into `BlockFetchConsensusInterface` * Removal of `bfcMaxConcurrencyBulkSync` * Changes in blockfetch decision tracing

Port of IntersectMBO/ouroboros-network#2721 Co-authored-by: Thomas Winant <[email protected]> Co-authored-by: Alexander Esgen <[email protected]>

* Move Genesis-specific BlockFetch config to GenesisConfig * Introduce GenesisConfigFlags for interaction with config files/CLI * Add missing instances for Genesis configuration

* Mention that the objector also gets demoted * Edit note on Interactions with the BlockFetch logic * Expand the comments motivating DynamoInitState and ObjectorInitState Co-authored-by: Nicolas “Niols” Jeannerod <[email protected]>

* Run more repetitions of LoE, LoP, CSJ, and gdd tests * Print timestamps for node restarts * Disable boring timeouts in the node restart test * Wait sufficiently long at the end of tests * Expect CandidateTooSparse in gdd tests * Add a notice about untracked delays in the node restart test * Set the GDD rate limit to 0 in the peer simulator * Have the peer simulator use the default grace period for chainsel starvations * Relax expectations of test blockFetch in the BulkSync case * Allow to run the decision logic once after the last tick in the blockfetch leashing attack * Shift point schedule times before giving the schedules to tests * Accomodate for separate decision loop intervals for fetch modes * Accomodate for timer added in blockFetchLogic * Switch peer simulator to `FetchModeBulkSync` * Allow parameterizing whether chainsel starvation is handled * Add some wiggle room for duplicate headers in CSJ tests * Disable chainsel starvation in CSJ test

…versarial

Co-authored-by: Nicolas Bacquey <[email protected]>

…shing attack

Niols added the Genesis PRs related to Genesis testing and implementation label Jul 3, 2024

amesgen force-pushed the blockfetch/milestone-1 branch from bab4a6a to 670ae32 Compare August 5, 2024 12:34

amesgen changed the base branch from niols/blockfetch-leashing to main August 5, 2024 12:34

facundominguez force-pushed the blockfetch/milestone-1 branch 2 times, most recently from da90985 to b3434e8 Compare August 5, 2024 21:05

amesgen reviewed Aug 6, 2024

View reviewed changes

amesgen mentioned this pull request Aug 6, 2024

BlockFetchConsensusInterface.readFetchedBlocks: report blocks as fetched as soon as they are enqueued #952

Open

facundominguez force-pushed the blockfetch/milestone-1 branch 3 times, most recently from 512f637 to 3d55d7e Compare August 6, 2024 17:43

amesgen force-pushed the blockfetch/milestone-1 branch from 16e5690 to 150ff7e Compare August 7, 2024 08:21

facundominguez force-pushed the blockfetch/milestone-1 branch from 150ff7e to 62605ad Compare August 7, 2024 11:02

amesgen mentioned this pull request Aug 7, 2024

Cabal >=3.12: handle additional output of cabal path --store input-output-hk/actions#28

Merged

amesgen force-pushed the blockfetch/milestone-1 branch from dd259c8 to db26d58 Compare August 7, 2024 12:29

facundominguez mentioned this pull request Aug 7, 2024

Update cardano-node after addition of the new BulkSync implementation IntersectMBO/cardano-node#5942

Draft

facundominguez force-pushed the blockfetch/milestone-1 branch from c07561d to dadb808 Compare August 8, 2024 19:36

facundominguez mentioned this pull request Aug 9, 2024

Allow to rotate dynamos for blockfetch #1123

Closed

amesgen marked this pull request as ready for review August 9, 2024 13:55

amesgen requested review from nfrisby, jasagredo, fraser-iohk, dnadales and RenateEilers as code owners August 9, 2024 13:55

dnadales reviewed Sep 9, 2024

View reviewed changes

amesgen mentioned this pull request Sep 9, 2024

Add a BlockFetch leashing attack #1156

Closed

dnadales mentioned this pull request Sep 25, 2024

Release: cardano-node 10.0 IntersectMBO/cardano-node#5978

Closed

15 tasks

facundominguez and others added 4 commits November 18, 2024 02:20

Introduce a collection of chainsync handles that synchronizes a map a…

7b07f09

…nd a queue

Implement a call to rotate dynamos in CSJ

47fb584

Specify the order in which to start the peers

9093056

Add a BlockFetch leashing attack test

e1ce58c

Niols and others added 22 commits November 18, 2024 02:20

Accomodate for changes to BlockFetch

702c704

* Addition of ChainSyncClientHandleCollection, grace period, and starvation event in BlockFetch * Plug `rotateDynamo` into `BlockFetchConsensusInterface` * Removal of `bfcMaxConcurrencyBulkSync` * Changes in blockfetch decision tracing

Track the last time the ChainDB thread was starved

564afac

Add explicit tracing events for CSJ

d23556f

ChainDB: let the BlockFetch client add blocks asynchronously

5c08faf

Port of IntersectMBO/ouroboros-network#2721 Co-authored-by: Thomas Winant <[email protected]> Co-authored-by: Alexander Esgen <[email protected]>

Update Genesis configuration

b5f8d13

* Move Genesis-specific BlockFetch config to GenesisConfig * Introduce GenesisConfigFlags for interaction with config files/CLI * Add missing instances for Genesis configuration

Set the jump size to smaller size for byron

9f32e7e

Limit the rate at which GDD is evaluated

c6625a4

Documentation edits for CSJ

5dcb752

* Mention that the objector also gets demoted * Edit note on Interactions with the BlockFetch logic * Expand the comments motivating DynamoInitState and ObjectorInitState Co-authored-by: Nicolas “Niols” Jeannerod <[email protected]>

ChainSync client: disconnect if stuck and not better than selection

90475f9

Don't let GDD drop candidates that do not intersect with the selection

f4023c8

Introduce peersOnlyAdversary and classify abnormal test peers as ad…

4e6a3de

…versarial

Document all tests that did not have documentation

28ab821

Depend on the ouroboros-network fork with the latest blockfetch

2bfb817

Add changelog fragments

cb6e892

Fix dropElemsAt implementation

59be05e

Adjust stalling test to have more kills by LoP

789e0e6

Co-authored-by: Nicolas Bacquey <[email protected]>

Document prop_blockFetchLeashingAttack

f42c566

Disable blockfetch timeouts in uniform tests

f1b6eb1

Groom comments and counterexample messages.

94c2ac6

Drop random points from adversarial schedules in the time limited lea…

c2763cb

…shing attack

Update configuration after recovering BulkSync in ouroboros-network

b3aa800

amesgen force-pushed the blockfetch/milestone-1 branch from 13c93e6 to b3aa800 Compare November 18, 2024 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full rework of the BlockFetch logic for bulk sync mode #1179

Full rework of the BlockFetch logic for bulk sync mode #1179

Niols commented Jul 3, 2024 •

edited by facundominguez

Loading

amesgen Aug 6, 2024

dnadales Sep 9, 2024

dnadales left a comment

dnadales Sep 9, 2024

dnadales Sep 9, 2024

dnadales Sep 9, 2024

dnadales Sep 9, 2024

dnadales Sep 9, 2024

dnadales Sep 9, 2024

		-- REVIEW: What about all the threads that are waiting to write in the queue and
		-- will write after the flush?!

Full rework of the BlockFetch logic for bulk sync mode #1179

Are you sure you want to change the base?

Full rework of the BlockFetch logic for bulk sync mode #1179

Conversation

Niols commented Jul 3, 2024 • edited by facundominguez Loading

CSJ Changes

BlockFetch tests

Integration of BlockFetch changes

ChainSel changes

Miscellaneous fixes

CSJ jump size adjustment

GDD rate limit

Candidate fragment comparison in the ChainSync client

Candidate fragment truncations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnadales left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Niols commented Jul 3, 2024 •

edited by facundominguez

Loading