Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: improve flush upload queue parallelism #10096

Open
Tracked by #9624
erikgrinaker opened this issue Dec 11, 2024 · 3 comments · May be fixed by #10144 or #10218
Open
Tracked by #9624

pageserver: improve flush upload queue parallelism #10096

erikgrinaker opened this issue Dec 11, 2024 · 3 comments · May be fixed by #10144 or #10218
Assignees
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver

Comments

@erikgrinaker
Copy link
Contributor

Currently, when flushing delta layers, we upload the layer and then update the index. But index updates act as an upload queue parallelism barrier. This means that we're effectively uploading layers one at a time.

Instead, we should flush a batch of layer files (something like 100-1000 MB) for each index update, allowing us to upload these in parallel.

Also note that the flush loop backpressure from #8550 will also prevent parallelism. We'll need to remove this first, see #10095.

@erikgrinaker erikgrinaker added a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver labels Dec 11, 2024
@erikgrinaker erikgrinaker self-assigned this Dec 11, 2024
@skyzh
Copy link
Member

skyzh commented Dec 11, 2024

In theory if we put multiple upload tasks into the queue, they will be fired in parallel.

let can_run_now = match next_op {
UploadOp::UploadLayer(..) => {
// Can always be scheduled.
true
}
UploadOp::UploadMetadata { .. } => {
// These can only be performed after all the preceding operations
// have finished.
upload_queue.inprogress_tasks.is_empty()
}
UploadOp::Delete(..) => {
// Wait for preceding uploads to finish. Concurrent deletions are OK, though.
upload_queue.num_inprogress_deletions == upload_queue.inprogress_tasks.len()
}
UploadOp::Barrier(_) | UploadOp::Shutdown => {
upload_queue.inprogress_tasks.is_empty()
}
};

I think compaction already utilized this code path?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 11, 2024

Yes. The problem is that when we flush a delta layer, we schedule both an UploadLayer and an UploadMetadata for every layer sequentially. As you can see in the code you quoted, UploadMetadata acts as a barrier for parallel uploads, since it waits for all preceding layers to finish uploading.

We want to flush a batch of layers and schedule UploadLayer for them all, and then schedule an UploadMetadata barrier for the entire batch.

for layer in layers_to_upload {
self.remote_client.schedule_layer_file_upload(layer)?;
}
self.remote_client
.schedule_index_upload_for_metadata_update(&update)?;

// Schedule remote uploads that will reflect our new disk_consistent_lsn
self.schedule_uploads(disk_consistent_lsn, layers_to_upload)
.map_err(|e| FlushLayerError::from_anyhow(self, e))?;

// Normal case, write out a L0 delta layer file.
// `create_delta_layer` will not modify the layer map.
// We will remove frozen layer and add delta layer in one atomic operation later.
let Some(layer) = self
.create_delta_layer(&frozen_layer, None, ctx)
.await
.map_err(|e| FlushLayerError::from_anyhow(self, e))?
else {
panic!("delta layer cannot be empty if no filter is applied");
};
(
// FIXME: even though we have a single image and single delta layer assumption
// we push them to vec
vec![layer.clone()],
Some(layer),
)

@erikgrinaker erikgrinaker linked a pull request Dec 13, 2024 that will close this issue
@erikgrinaker
Copy link
Contributor Author

There's a prototype PR for deferred index uploads during flush in #10144. However, this approach has a fair number of issues -- we have to wait for "some time" in case further layers are flushed, and this both breaks caller expectations that index uploads are scheduled immediately, and it can still cause index barriers if the previous index isn't uploaded before the next layer comes in.

Instead, let's try to reorder layer uploads ahead of index uploads in the upload queue:

// If we cannot launch this task, don't look any further.
//
// In some cases, we could let some non-frontmost tasks to "jump the queue" and launch
// them now, but we don't try to do that currently. For example, if the frontmost task
// is an index-file upload that cannot proceed until preceding uploads have finished, we
// could still start layer uploads that were scheduled later.
if !can_run_now {
break;
}

A few things to keep in mind:

  • We want to upload indexes as soon as the necessary layers are uploaded.
  • We want to coalesce indexes where we can.
  • We don't want to reorder operations on the same layer filename.
  • We don't want to reorder wrt. deletes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver
Projects
None yet
2 participants