-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: single-pass snapshot reading #1838
base: master
Are you sure you want to change the base?
Conversation
…eful_shutdown' into feature/genesis_optimize_deriving
…eful_shutdown' into feature/genesis_optimize_deriving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only made it part way through.
groups, | ||
db, | ||
progress_reporter, | ||
start_imports!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to get any kind of compiler error/test failure if we add new tables in the future, or are we just going to need to remember to add them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all tables get a snapshot file due to two reasons:
- it is not migrated at all
- or it is derived from an already present file
So if a table is to be migrated the developer should:
- export the data (if not derived)
- import it (by implementing
ImportTable
or changing the impl of an existing one in the case of derived tables).
} | ||
let group = group?; | ||
|
||
if Some(index) > on_chain_last_idx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid doing a comparison of Option
here. I wasn't sure what the behavior was, so I went and tried it out. It's not completely intuitive.
Also, reading through this, I mistook it for an if let Some
statement.
Perhaps we could change it to:
index > on_chain_last_idx.unwrap_or(0)
I guess the case where index==0
and on_chain_last_idx==None
it would have different behavior, since Some(0) > None
but 0 !> 0
. Would that be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no batch is handled then the database has no entry for that table. i.e. a select would return 0 rows. Here given as None
from the db.
When a batch is handled it's index (parquet group number) is recorded.
So an entry of 0
in the database means the first batch is handled successfully.
So here the resume logic is identical to the comparison of options. If there is no last recorded idx for the on chain database then no batch was imported and so proceed to import.
If there is an entry in the db check that this index is greater so you don't reimport the same batch.
But unwrap_or(0)
would cause problems for the first batch because of the difference in behavior you mentioned.
I agree the options comparison is unclear if you're coming across it for the first time. Maybe the second time as well :D.
The alternative would be to unpack the behavior:
let comparison_result = if let Some(chain_idx) = on_chain_last_idx {
index > chain_idx
} else {
true
};
|
||
if is_cancelled { | ||
bail!("Import cancelled") | ||
if Some(index) > off_chain_last_idx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
#[derive(Default, Clone)] | ||
struct Spy { | ||
on_chain_called_with: Arc<Mutex<Vec<Vec<TableEntry<Coins>>>>>, | ||
off_chain_called_with: Arc<Mutex<Vec<Vec<TableEntry<Coins>>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names confuse me.
Something has called the on_chain
data? And this is what it "called" it "with"? That's my best guess.
Can we change this to something clearer, along with the getters below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A spy records arguments, ret values etc.
This Spy constructs TestTableImporter
s that implement ImportTable
.
ImportTable
has two methods: a handler for on_chain db changes and a handler for off_chain
db changes.
So the spy records what the on_chain
was called_with
and what the off_chain
was called_with
.
#[test_case::test_case(Some(1), Some(0); "off chain reverted")] | ||
#[test_case::test_case(Some(3), Some(1); "off chain reverted multiple times")] | ||
#[test_case::test_case(Some(1), Some(3); "on chain reverted multiple times")] | ||
fn can_recover_when_both_tx_dont_succeed_together( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test could have a clearer name.
What "can_recover"? What are the "tx"s here? I don't see anything called "tx" in this test. I feel like the name should have more to do with reconciling the groups being processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What "can_recover"?
Importing/genesis/regenesis after the process exists due to an error.
What are the "tx"s here?
Database transactions. Two of them, one for on-chain and another on for off chain databases. Only when successfully committed the import progress is going to be incremented. They will be committed only if nothing failed during the import.
I feel like the name should have more to do with reconciling the groups being processed.
Maybe synchronizes_imports_upon_previous_failure
? To say that even if a batch succeeds on one of the databases and fails on another, the next time the import is run the imports are first going to be synchronized.
Thing is I don't want to commit to synchronizes
because that is not a requirement. If somebody parallelized this they can very well have both databases continue importing even though they're not handling the same batch at the same time.
I guess failed batches will be retried after failure. Something along those lines?
closes: #1823
When importing, if a table is derivable we're currently (in master) re-reading the data from which the table is to be derived. This multiplies the overhead in terms of IO, decompressing and decoding the data.
This PR adapts the import logic so that it can write to both on and off chain databases while keeping the cursor correct in the case either fails.
The downside is that we will first import a batch on-chain and then, if successful, off-chain (if any tables are to be derived from it). There is opportunity for concurrency here in the future -- the import task can split off sub-tasks for each batch, one for on-chain and another (or more) for off chain.