New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

perf: single-pass snapshot reading #1838

Draft

segfault-magnet wants to merge 123 commits into master from feature/genesis_optimize_deriving

Contributor

segfault-magnet commented Apr 18, 2024 •

edited

Loading

closes: #1823

When importing, if a table is derivable we're currently (in master) re-reading the data from which the table is to be derived. This multiplies the overhead in terms of IO, decompressing and decoding the data.

This PR adapts the import logic so that it can write to both on and off chain databases while keeping the cursor correct in the case either fails.

The downside is that we will first import a batch on-chain and then, if successful, off-chain (if any tables are to be derived from it). There is opportunity for concurrency here in the future -- the import task can split off sub-tasks for each batch, one for on-chain and another (or more) for off chain.

segfault-magnet and others added 30 commits

March 30, 2024 02:30


          snapshot fragments for json work

ed3ab94


          parquet fragment support, cleanup pending

42e0763


          add tests for fragments

58b04be


          comment out denies

f440bb9


          Merge branch 'master' into feature/parallel_snapshot_writing

3fb1586


          snapshot generation uses concurrent workers

4936ed9


          task_manager used for import/export of snapshot

3a0b898


          cleanup imports

316b1c9


          use rayon in genesis importer

35449e6


          move files around

633dc22


          enable deny lints, fix errors in chain-config

b0fdfdc


          wip, investigating features

1f0b3c3


          feature gate imports, fix unused deps

1804a7d


          ci checks


          remove unused result

e01d204


          use rayon for exporter

892849f


          remove uuid dep

c8c28f7


          dry up fragments tests

90785f4


          deduplicate tests

58a4e25


          inline path

cb27956


          dedupe writer tests

4171e87


          restructure into import/export format

9d09510


          format and cargo sort

d84bd72


          update change log

ebe836b


          Merge branch 'master' into feature/parallel_snapshot_writing

6060bad


          entries filter

47e5432


          optimize

3c1ee01


          shorten bounds

b9c3906


          Merge branch 'master' into feature/parallel_snapshot_writing

992bcc0


          can cancel/resume regenesis, pending progress info and e2e tests

ec85160

segfault-magnet added 4 commits

April 23, 2024 19:31


          remove tokio rayon, implement suggestions

c1a9ee7


          fix unit tests

c1ff3d0


          inline when small number of groups


          fix unit tests

ad96955

segfault-magnet marked this pull request as draft

April 23, 2024 19:00

segfault-magnet added 3 commits

April 23, 2024 21:05


          Merge remote-tracking branch 'origin/feature/snapshot_generation_grac…

d7bd144

…eful_shutdown' into feature/genesis_optimize_deriving


          remove MultiCancellationToken in favor of a less general solution

dbbf132


          Merge remote-tracking branch 'origin/feature/snapshot_generation_grac…

99ad6f8

…eful_shutdown' into feature/genesis_optimize_deriving

Base automatically changed from feature/snapshot_generation_graceful_shutdown to master

April 24, 2024 07:18

segfault-magnet added 9 commits

April 24, 2024 09:18


          Merge remote-tracking branch 'origin/master' into feature/genesis_opt…

d626d44

…imize_deriving


          revert files that should not have been changed

270041a


          fix build

f628c88


          add check for unique table names

06ec673


          readability


          transactions not inserted in master

13110bc


          unneccessary changes

a16b316


          nits

b453447


          add changelog

segfault-magnet marked this pull request as ready for review

April 24, 2024 15:23

segfault-magnet added 3 commits

April 30, 2024 16:14


          Merge remote-tracking branch 'origin/master' into feature/genesis_opt…

3b0e7d4

…imize_deriving


          FuelBlockIdsToHeights is never exported as a table but rather derived

59879ea


          clippy

ec53cbf

MitchTurner reviewed

View reviewed changes

Member

MitchTurner left a comment

I've only made it part way through.

crates/fuel-core/src/service/genesis/importer.rs

-                          groups,
-                          db,
-                          progress_reporter,
+                      start_imports!(

Member

MitchTurner May 2, 2024

Are we going to get any kind of compiler error/test failure if we add new tables in the future, or are we just going to need to remember to add them here?

Contributor Author

segfault-magnet May 3, 2024

Not all tables get a snapshot file due to two reasons:

it is not migrated at all
or it is derived from an already present file

So if a table is to be migrated the developer should:

export the data (if not derived)
import it (by implementing ImportTable or changing the impl of an existing one in the case of derived tables).

crates/fuel-core/src/service/genesis/importer/import_task.rs

+                      }
+                      let group = group?;
+                      if Some(index) > on_chain_last_idx {

Member

MitchTurner May 2, 2024

I think we should avoid doing a comparison of Option here. I wasn't sure what the behavior was, so I went and tried it out. It's not completely intuitive.

Also, reading through this, I mistook it for an if let Some statement.

Perhaps we could change it to:

index > on_chain_last_idx.unwrap_or(0)

I guess the case where index==0 and on_chain_last_idx==None it would have different behavior, since Some(0) > None but 0 !> 0. Would that be acceptable?

Contributor Author

segfault-magnet May 3, 2024

If no batch is handled then the database has no entry for that table. i.e. a select would return 0 rows. Here given as None from the db.

When a batch is handled it's index (parquet group number) is recorded.

So an entry of 0 in the database means the first batch is handled successfully.

So here the resume logic is identical to the comparison of options. If there is no last recorded idx for the on chain database then no batch was imported and so proceed to import.

If there is an entry in the db check that this index is greater so you don't reimport the same batch.

But unwrap_or(0) would cause problems for the first batch because of the difference in behavior you mentioned.

I agree the options comparison is unclear if you're coming across it for the first time. Maybe the second time as well :D.

The alternative would be to unpack the behavior:

let comparison_result = if let Some(chain_idx) = on_chain_last_idx {
    index > chain_idx
} else {
    true
};

crates/fuel-core/src/service/genesis/importer/import_task.rs

-                      if is_cancelled {
-                          bail!("Import cancelled")
+                      if Some(index) > off_chain_last_idx {

Member

MitchTurner May 2, 2024

Same question here.

crates/fuel-core/src/service/genesis/importer/import_task.rs

+                  #[derive(Default, Clone)]
+                  struct Spy {
+                      on_chain_called_with: Arc<Mutex<Vec<Vec<TableEntry<Coins>>>>>,
+                      off_chain_called_with: Arc<Mutex<Vec<Vec<TableEntry<Coins>>>>>,

Member

MitchTurner May 2, 2024

These names confuse me.

Something has called the on_chain data? And this is what it "called" it "with"? That's my best guess.

Can we change this to something clearer, along with the getters below.

Contributor Author

segfault-magnet May 3, 2024

A spy records arguments, ret values etc.

This Spy constructs TestTableImporters that implement ImportTable.

ImportTable has two methods: a handler for on_chain db changes and a handler for off_chain db changes.

So the spy records what the on_chain was called_with and what the off_chain was called_with.

crates/fuel-core/src/service/genesis/importer/import_task.rs Outdated Show resolved Hide resolved

crates/fuel-core/src/service/genesis/importer/import_task.rs

+                  #[test_case::test_case(Some(1), Some(0); "off chain reverted")]
+                  #[test_case::test_case(Some(3), Some(1); "off chain reverted multiple times")]
+                  #[test_case::test_case(Some(1), Some(3); "on chain reverted multiple times")]
+                  fn can_recover_when_both_tx_dont_succeed_together(

Member

MitchTurner May 2, 2024

I think this test could have a clearer name.

What "can_recover"? What are the "tx"s here? I don't see anything called "tx" in this test. I feel like the name should have more to do with reconciling the groups being processed.

Contributor Author

segfault-magnet May 3, 2024

What "can_recover"?

Importing/genesis/regenesis after the process exists due to an error.

What are the "tx"s here?

Database transactions. Two of them, one for on-chain and another on for off chain databases. Only when successfully committed the import progress is going to be incremented. They will be committed only if nothing failed during the import.

I feel like the name should have more to do with reconciling the groups being processed.

Maybe synchronizes_imports_upon_previous_failure ? To say that even if a batch succeeds on one of the databases and fails on another, the next time the import is run the imports are first going to be synchronized.

Thing is I don't want to commit to synchronizes because that is not a requirement. If somebody parallelized this they can very well have both databases continue importing even though they're not handling the same batch at the same time.

I guess failed batches will be retried after failure. Something along those lines?

segfault-magnet and others added 2 commits

May 3, 2024 19:13


          Merge branch 'master' into feature/genesis_optimize_deriving

5787c06


          rename

47b241d

segfault-magnet requested a review from MitchTurner

May 3, 2024 18:22

segfault-magnet and others added 3 commits

May 6, 2024 14:44


          Merge branch 'master' into feature/genesis_optimize_deriving

68d6371


          Merge branch 'master' into feature/genesis_optimize_deriving

6677b76


          Merge branch 'master' into feature/genesis_optimize_deriving

e7d5755

xgreenx marked this pull request as draft

August 27, 2024 09:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

MitchTurner Awaiting requested review from MitchTurner MitchTurner is a code owner

xgreenx Awaiting requested review from xgreenx xgreenx will be requested when the pull request is marked ready for review xgreenx is a code owner

Dentosal Awaiting requested review from Dentosal Dentosal will be requested when the pull request is marked ready for review Dentosal is a code owner

At least 2 approving reviews are required to merge this pull request.

Labels

None yet