-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design for velero backup performance improvements #7628
base: main
Are you sure you want to change the base?
Conversation
/kind changelog-not-required |
e3ccdb2
to
b6a814f
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #7628 +/- ##
==========================================
- Coverage 61.88% 58.66% -3.23%
==========================================
Files 266 344 +78
Lines 29462 28731 -731
==========================================
- Hits 18232 16854 -1378
- Misses 9938 10448 +510
- Partials 1292 1429 +137 ☔ View full report in Codecov by Sentry. |
|
||
### Per-backup worker pool | ||
|
||
The current design makes use of a permanent worker pool, started at backup controller startup time. With this design, when we follow on with running multiple backups in parallel, the same set of workers will take ItemBlock inputs from more than one backup. Another approach that was initially considered was a temporary worker pool, created while processing a backup, and deleted upon backup completion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are your thoughts on API server throttling due to this parallelism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an APIServer point of view, I think permanent vs. temporary worker pools should be basically equivalent. Also, from an APIServer point of view, Restore is more likely to hit throttling than backup, since every item restore requires at least two APIServer calls. On the backup side, the Item Collector uses list calls to grab the GVRs included in the backup, so individual calls are only needed for items returned as AdditionalItems that weren't in the original list or by BIA plugins that make their own API calls.
All of this is to say that I suspect even with parallel processing, most backups will make less frequent APIServer calls than an equivalent restore. It doesn't guarantee that we won't run into throttling issues, but in cases where we do, we're probably already hitting them on restore. As for what we need to do, I'm not 100% sure. But since that's as much a potential issue with current restore code, would it make sense to address that for both Backup and Restore as a separate effort?
b6a814f
to
d4edf1a
Compare
- Before loop iteration, create a new `itemsInBlock` map of type map[velero.ResourceIdentifier]bool which represents the set of items already included in a block. | ||
- If `item` is already in `itemsInBlock`, continue. This one has already been processed. | ||
- Add `item` to `itemsInBlock`. | ||
- Load item from ItemCollector file. Close/remove file after loading (on error return or not, possibly with similar anonymous func to current impl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to postpone the item load to the time when backing up the item. Loading the items into memory will take large memory if the number of the total items are huge.
We should keep the item data in disk until we really back it up. If we need to visit the item multiple times during collecting and backup, we should use some cache mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worth considering. Let me go back and look at the structs again, but it may be that instead of passing full unstructured content to the goroutine we should instead pass the struct from the item collector which contains the file path instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in thinking further around this, there's a tradeoff here. We already need to load the full item content from file in order to call GetAdditionalItems, since often plugin logic on what items are needed is based on item content and not just type/name. So if we pass in file path instead, we'll end up loading every item from file twice. Also, since we're using a buffered channel to communicate with the workers, the number of blocks in memory at any given time will be limited. For example, if worker count is 10 and buffer size is 10, then at most 20 blocks will be in memory at a time, even if the backup has thousands of items.
|
||
The worker pool will be started by calling `RunItemBlockWorkers` in `backupReconciler.SetupWithManager`, passing in the worker count and reconciler context. | ||
`SetupWithManager` will also add the input channel to the `itemBackupper` so that it will be available during backup processing. | ||
The func `RunItemBlockWorkers` will create the `ItemBlockWorkerPool` with a shared buffered input channel (fixed buffer size) and start `workers` gororoutines which will each call `processItemBlocksWorker`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer size is critical to the performance right? So should we make the size configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably make it configurable. It's not nearly as important as the worker pool size, but it does make sense to make it configurable as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which type of performance are we talking about.
It is critical, as Scott pointed out above, to know the amount of stuff that is "waiting" and what is loaded into memory.
It is not critical, that a larger buffer will make the threads work faster. The only time that could be the case is if you have some number of things that are very fast, faster than the thing to put them in the buffer, and your workers can stall on a hard thing, and then blow through the buffer faster than the pusher can add. I honestly don't think we have this problem.
With this said, I don't see how we explain this option and what different values mean for the end user's performance besides memory in use. I would suggest that we don't make this configurable, and once implemented, we run performance testing to determine if different values here do have an impact. If they do, we will be able to tell the user what that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was just meaning the performance vs. memory usage. If CPU is sufficient, the faster of processing the more memory will be used. Since we don't know the CPU in users environment, we cannot decide the best value.
I agree with @shawn-hurley, we should do some performance testing, but we can only create a guidance since the CPU and memory in our environment is different from users'.
Furthermore, here we cache each item data in the buffered channel, while the size of items varying from the items themselves, which is also not decidable by testing in our own environment.
d4edf1a
to
fb600f4
Compare
I finally got a chance to read the proposal and it looks pretty good. However, I have few comments. What is the impact of "orderedResources" setting (in Backup/Schedule CR) on the concurrency? It seems to me that if users specify a particular order for resources - say PVCs - backup of the item blocks that contain the ordered resources must be serialized. Also, I think overall backup performance is determined by how fast CSI snapshot operations and data mover operations are done, rather than by the concurrency of the resource manifest backup. In that sense, the concurrency proposed in this document will only benefit if there are multiple item blocks - each with its own set of PVCs. If that is not the case (say, you are backing up a single namespace with a pod that has 5 PVCs), users won't see much benefit even if the worker count is set to high. My point in general is that this proposal should be studied from the point of view of its impact on the PVC backup. Apart from what I mentioned above, the concurrency can result in more CSI snapshot API calls, more data mover operations to take place. Perhaps, it is worth explicitly calling it out in the document. |
This is actually a great point. I don't think that's accounted for right now. Current design takes the item collector list and generates item blocks from that. orderedResources are moved to the beginning of the list, but to take a specific example, say we have item1-item5 from ordered resources, none of which add additional items via plugins, if we have a worker pool of size 5, that means with the current design, all 5 items will be processed concurrently. I think your proposal would be that everything that comes via orderedResources should be forced to be executed sequentially. If we decide to take this approach, this means (at a minimum), everything that's in orderedResources must belong to the same ItemBlock -- that guarantees that these items will be processed in order, in the same thread. But that may not be sufficient. I think we may also want to make sure that we don't process any other items until orderedResources are complete. Is this correct? If yes, then I think the following design modification would resolve it:
@draghuram does the above resolve this appropriately?
For fs-backup scenarios, with multiple pods, this design will probably help us. It will also help in cases where there are a great many CSI snapshot operations. One case we ran into recently was one where there were a large number of VMs being backed up where the storage capacity for each was relatively small, but the synchronous component of CSI operations added up to many hours for the whole backup -- if we had a pool with 10 workers, that synchronous component would have been greatly reduced.
datamover operations have their own concurrency limitations -- number of node agent pods*per-node-agent concurrency. In backups where that's the limiting factor, this design won't do much. In designs where that is not the limiting factor, I'd expect some improvement here. Also note that there are two out-of-scope (see Non Goals) enhancements that need parts of this design to build on top of that meet other future needs:
|
@draghuram @shawn-hurley Actually, a slight modification to this suggestion above probably makes sense: "instead of one loop passing over the full item list, we have 2 loops. First loop stops iterating when it hits an item not identified in 1) as an ordered resource. If there are no ordered resources, this means we don't process anything here. We do the usual itemBlock processing, except instead of one block per item, all items identified as ordered resources are appended together into a single ItemBlock. We now back up this one ItemBlock consisting of all OrderedResources synchronously. Once it's done, we then go into the item processing loop described in the current design (with ordered resources already having been removed/handled)." Rather than doing this initial OrderedResources ItemBlock completely synchronously, we should probably send it to the worker channel just like we'll do later for the other ItemBlocks -- it's just we won't send any additional ItemBlocks until this one has completed, otherwise once we get to multiple backups at once, the worker count for "how many items to back up at the same time" will be off by one while this synchronous ItemBlock backup is happening. Instead, we should probably just send this first block to the channel, wait for it to complete, and then do the second pass through everything as proposed above. |
The k8s resources are already downloaded from the API server to the local disk by the item collection process before the concurrent processing proposed by this design, so seems the concurrency isn't very necessary. And I agree with @draghuram that the overall backup performance is determined by how fast the data is handled, not the k8s resources. |
@sseago I was thinking that you would still create multiple item blocks spanning ordered resources, but you would process them in order, rather than sending them to the workers concurrently. Your proposal of using one item block would work too but the difference is that other item block level operations such as hooks may be affected by putting all the resources in one item block. By that I mean, we may be executing hooks for a group of pods that may not be strictly related. To be sure, I don't now if this is such a big concern. I haven't used "orderedResources" feature myself a great deal. Also, I agree with you that the proposal helps with the performance by parallelizing operations such as CSI and other time taking operations. BTW, were you referring to KubeVirt plug-in when you said "VM operations"? And yes, item block functionality would come handy when it is time to implement volume group snapshots. |
@draghuram Ahh, yes, you're right. If pods are included, we'd want to treat them as separate blocks (since hooks run before/after ItemBlocks) -- so basically, we'd still need to identify which items in the collector output come from ordered resources, so we first process the ordered resources, creating ItemBlocks normally, but passing one at a time to be processed, waiting for the response before going to the next, once we hit the first ItemBlock without ordered resources, we can start processing them in parallel. @ywk253100 See the comment above, but we've seen scenarios where if there are a large number of volumes, the backup took many hours due to the synchronous first part of the CSI backup process -- that's one example where processing items in parallel should shorten overall backup time. Also note the other two use cases which this design enables:
|
Signed-off-by: Scott Seago <[email protected]>
fb600f4
to
9219e58
Compare
|
||
The worker pool will be started by calling `RunItemBlockWorkers` in `backupReconciler.SetupWithManager`, passing in the worker count and reconciler context. | ||
`SetupWithManager` will also add the input channel to the `itemBackupper` so that it will be available during backup processing. | ||
The func `RunItemBlockWorkers` will create the `ItemBlockWorkerPool` with a shared buffered input channel (fixed buffer size) and start `workers` gororoutines which will each call `processItemBlocksWorker`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was just meaning the performance vs. memory usage. If CPU is sufficient, the faster of processing the more memory will be used. Since we don't know the CPU in users environment, we cannot decide the best value.
I agree with @shawn-hurley, we should do some performance testing, but we can only create a guidance since the CPU and memory in our environment is different from users'.
Furthermore, here we cache each item data in the buffered channel, while the size of items varying from the items themselves, which is also not decidable by testing in our own environment.
// responses from BackupItemBlock calls are in responses | ||
``` | ||
|
||
The ItemBlock processing loop described above will be split into two separate iterations. For the first iteration, velero will only process those items at the beginning of the loop identified as `orderedResources` -- when the groups generated from these resources are passed to the worker channel, velero will wait for the response before moving on to the next ItemBlock. This is to ensure that the ordered resources are processed in the required order. Once the last ordered resource is processed, the remaining ItemBlocks will be processed and sent to the worker channel without waiting for a response, in order to allow these ItemBlocks to be processed in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean the items identified as orderedResources
are always processed sequentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lyndon-Li Yes, that is the intent. Since the meaning of this field is "please back up these resources in this exact order", they must be backed up in a single thread -- so we'll process them normally, but will only dispatch one at a time to the worker pool to ensure that they aren't backed up out of order.
@sseago 2. Volume group snapshot 3. Large number of volumes backup 4. The CSI snapshot corner case --- the CSI driver retries for a long time before CSI handle appears 5. Parallel backups Secondly, let me clarify the major concern from us:
Therefore, we suggest below adjustment:
|
@Lyndon-Li I spent some time discussing this with @shawn-hurley today, and I think we may be able to drop the "everything must be v3" requirement, as long as we document clearly for plugin authors that for any pod-related dependencies that would break if the deps were done in parallel, they need to provide the v3 plugin. Effectively, the only use case that we know would break here would be a scenario with multiple pods that need to be backed up together. But that's a use case that's already somewhat broken, since inter-pod dependencies and hooks aren't handled at all right now.
On your second concern, we're adding the new call after velero already has the full content in the current code, which we need anyway to back up the item. The only extra memory usage, then, is related to the buffer size of the channel -- if we have a buffer size of 10, then 10 ItemBlocks have content in-memory, which is a relatively small amount of memory, all things considered, especially since Kubernetes metadata is limited in size for each resource. You cannot have arbitrarily large k8s resources. Also, since we already have the logic in the existing BIAs to identify items needed (see the pod_action upgrade example in the design), we're just moving logic around to reuse it. If we were to move this into discovery, we'd need to replicate that logic elsewhere and modify the item collector output to be block-aware. It also would not be extensible at all, but we really do need to do this in a plugin-aware way (more on that below). The main objection we have to hard-coding the logic around pod->PVC dependencies and the multi-pod dependencies for VolumeGroup support is that this prevents people from extending this concept via plugins. There are other use cases that require pods to be backed up together that internal Velero logic won't know about. For example, if I'm maintaining an operator that I want my users to be able to back up with Velero, with the plugin approach, I can write a BackupItemAction which operates on pods in my operator that return the other pods in the GetAdditionalItems method, and this will ensure that 1) pod hooks are run for my operator pods before and after the group of pods is backed up, and 2) these pods and their associated PVCs will be backed up in a single worker, since they're in one ItemBlock. |
To summarize my last comment, regarding the two major points of disagreement:
Or an even shorter summary: on point 1 above, we agree with you now and will update the design, but on point 2, we're going to keep the BIAv3 as-is, since it enables some use cases that we need which aren't handled without it. |
A little more detail on the concerns around calling the new plugin methods after item collection is done. First, as for the objection that we need the full object in memory -- we already need it in memory to process it a bit later when it gets to the worker thread, so worst case scenario is we're holding on to a few of these longer than we would be without ItemBlocks. How much memory are we talking about? Kubernetes resource metadata is limited in size -- somewhere around 1mb max. So we're talking about resourceSizebufferSizeitemsPerBlock -- if we assume a buffer size of 10 and 5 items per block (I expect most items will be in blocks of one, and a few items will be in blocks larger than 5) -- this means we're holding an extra 50mb of item data in memory, at most. As for the suggestion of doing this during item collection, there are a couple of problems with that. Currently the item collector grabs one GroupResource at a time via a
By processing the ItemBlock at the end of item collection, the items we need will already be loaded from the cluster and referenced in the items list, so when the current item returns "item 2" as a needed item, we can just grab that item from the list, mark it as "in a block already" (so later it won't be added to another one), and then call this item's additional items method, etc. This can all be done with minimal disruption to the rest of the workflow, and we're limiting the number of items in memory to the size of the input channel's buffer. This also preserves the ability for plugin authors to create their own ItemBlocks as necessary for their backup needs. |
…rnatives Signed-off-by: Scott Seago <[email protected]>
2ce6bc6
to
7873ced
Compare
I've updated the design based on the conversations earlier this week:
|
Concerns about the design from review and meeting commentsAt this point, there seem to be two primary objections to the design in the review (and community meeting) comments:
Responses to concern 1
Responses to concern 2
Specific use cases that would require the proposed pluginCassandra (cass-operator)For users of Cassandra, cass-operator creates a StatefulSet with three pods, each of which has its own mounted volume. With a plugin API to associate related entities, we can make sure that all three pods are included in the same ItemBlock. The internal-to-velero pod plugin will make sure that the volumes for each pod are included. This will make sure that pre-backup hooks on all three Cassandra pods are run first, followed by CSI snapshot creation, followed by the post-backup hooks. Alternate design considerationsNew plugin type ItemBlockActionOn the last community meeting, Daniel suggested we might consider creating a new ItemBlockAction plugin type instead of extending BackupItemAction. While this will require more upfront work than extending BIA, once the plugin API is added, there's very little in the overall design here that's changed. I expect that it will work just as well (no better, no worse) as the proposed BIAv3 design. The work required for plugin authors is also approximately the same. Because of this, while I personally have a mild preference for the BIAv3 design, I don't have any strong objections to the ItemBlockAction approach. This has been added to the "Alternatives considered" section of the design. |
@sseago
Finally, we want to share our suggestions as below:
|
04e7449
to
1893a1d
Compare
I think this distinction is incorrect. All of the groupings are for logically associated items. The distinction is as follows:
In particular, the external groupings are related to item dependencies/associations, just like the internal ones. Also, these aren't provided by end users but by application or infrastructure developers. End users are not expected to know or understand these dependencies at the individual resource level.
This can't come from user input -- it comes from application/operator developers, which means even if we did it during item collection, it would need to be a plugin of some sort. That plugin would need to take in a resource (with content) and return a list of items that need to be associated with it. So doing this in the ItemCollector won't be any simpler than defininig a new ItemBlockAction plugin which takes in the exact same thing and returns the same list of resources. Doing it in the item collector would be more complicated, though, since then we'll need to change the whole output structure to return groups of items rather than just a list, and on top of that since the associated items may not have been listed from the cluster yet, we would need to pull them out individually with a Get call, and then we'd need to check for every future item as we collect it to make sure it's not already in a separate group. Essentially we'd be taking all of the item block processing logic that the design proposes post-item-collection and do it during item collection, but since we don't have the full item list at this point, it's much less efficient. Is there any disagreement here?
We don't have a fixed number of ItemBlocks because as we process the list we get from the collector, we end up adding items from further down into the current block. By the time we have teh full list of item blocks, we are almost done with processing. More to the point, items returned by the IBA plugin API call don't create new ItemBlocks, they add later items to the current block. |
Hey folks, I want to add a small summary to try and see if we can get agreement on some points. The goal here is to lay out some high-level concepts to make sure that we agree with what is proposed and that it solves the problem. Points
I hope that we can gain agreement on these three points, if we can see what we agree on and don't agree on to make the disagreements clear from here. I am worried that I am not clear which points we have discussion on and what we agree on so a level set at this level will be helpful. |
#### Proto definition (compiled into golang by protoc) | ||
|
||
The ItemBlockAction plugin type is defined as follows: | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all code blocks are Go code examples, right?
would add language to these code blocks for syntax highlighting, to improve reading
``` ```go
code ---> code
``` ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mateusoliveira43 most of them, yes. I'll add it where appropriate. The one exception is the protobuf API spec section.
|
||
## High-Level Design | ||
|
||
### Phase 1: ItemBlock processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still wish to suggest that a section be added in High-Level
design to highlight what an "ItemBlock" is, what constraints it has, and how velero will handle such ItemBlocks.
This will improve the readability of this design and help other adopter determine whether they wanna implement an "IBA" and what will be put into an ItemBlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll add that.
|
||
A new PluginKind, `ItemBlockActionV1`, will be created, and the backup process will be modified to use this plugin kind. | ||
|
||
For any BIA plugins which return additional items from `Execute()` that need to be backed up at the same time or sequentially in the same worker thread as the current items should add a new IBA plugin to return these same items (minus any which won't exist before BIA `Execute()` is called). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed? If the BIA is called in a thread which backs up items in an itemBlock, these additional items will be handled within BackupItemBlock
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because this is the only way we make sure we guarantee that the items get backed up in the same worker thread at the same time. If we don't have an IBA action which adds these items to the item block, then all of the items will be considered as belonging to separate ItemBlocks. In some scenarios things will work just fine, but in others they may not.
As an example, lets take a backup with Pods, PVCs, and PVs and nothing else -- each Pod contains one mounted PVC bound to a PV. If we create IBA plugins for Pod and PVC, then when we create ItemBlocks, each block will contain one pod, one PVC, and one PV. If there's just one pod, there's one item block. If there are 100 pods, there are 100 ItemBlocks. If we don't create the new IBA plugins for Pod and PVC, then Velero will just have isolated ItemBlocks with Pods, then PVCs, then PVs, since the IBA action is what groups items into blocks.
Now lets look at two scenarios -- the first will work as you expect (additional items will be handled within BackupItemBlock), and in the second they may not.
Scenario 1: Number of pods is much larger than number of ItemBlock workers.
In this case, say we have 100 pods and 5 workers. While the pre-calculated ItemBlocks will only have one item each, the Pod blocks will be first, followed by the PVC blocks, then the PV blocks. So the first 5 pod blocks get sent to the 5 workers. As each is processed, the Execute() call pulls in the additional items, so we get each pod backed up with its PVC and PV, and all 3 items are marked as already backed up. As one finishes, another Pod block is pulled off the queue. By the time the first PVC block gets pulled in, 96 out of 100 pods (along with their PVCs) have already been backed up, so this PVC has a 96% chance of already being backed up -- the first thing Velero does in BackupItem is to see whether this item has already been backed up, and if it has, it exits immediately. The block is done, pull the next, etc. It's possible that all 100 PVCs will have already been handled before their PVC blocks arrive in the worker pool -- at most, 4 of these may not have, in which case PVC backup and pod hooks may run out-of-order -- which is the risk when we don't have an IBA plugin defined if there are interdependencies.
Scenario 2: There are many more worker threads than pods
In this case, lets assume there are only 2 pods but 10 workers. In this case, both pods, both PVCs, and both PVs are all sent to workers at the same time. So the likelihood is high that by the time the Pod plugin gets that PVC additional item return, it's already been started backing up in another worker, so it will be skipped here. This means, especially if it's a large volume, that the post hook is likely to run before the PVC snapshot has been created.
Creating the IBA plugin eliminates this problem entirely. This is the whole reason we have the IBA plugin -- it enables us to create ItemBlocks based on item interdependencies. If we don't have this, then we can't create meaningful ItemBlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I didn't consider "scenario 2".
But in regards to the question of whether a BIA plugin returning additionalItems
should have an IBA to group these items earlier, IMO the answer is till That depends
. B/C sometimes the additional items do not have to be in the same itemBlock
with the "original item". Another possibility is that some of these additional items may be created within the Execute()
function of BIA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonerjt "Another possibility is that some of these additional items may be created within the Execute() function of BIA." -- yes, this is called out explicitly as a case where the items will not be in the ItemBlock. Actually, your "that depends" is also covered in the text that we're responding to now. "For any BIA plugins which return additional items from Execute()
that need to be backed up at the same time or sequentially in the same worker thread as the current items should add a new IBA plugin to return these same items" -- the key part here is "that need to be backed up at the same time or sequentially in the same worker thread" -- so "sometimes the additional items do not have to be backed up in the same ItemBlock with the original item", in that case they do not need to be in the IBA. What you said in your comment and what's in the design say the same thing, just in different words.
- If `item` is already in `itemsInBlock`, continue. This one has already been processed. | ||
- Add `item` to `itemsInBlock`. | ||
- Load item from ItemCollector file. Close/remove file after loading (on error return or not, possibly with similar anonymous func to current impl) | ||
- Get matching IBA plugins for item, call `GetRelatedItems` for each. For each item returned, get full item content from ItemCollector (if present in item list, pulling from file, removing file when done) or from cluster (if not present in item list), add item to the current block, add item to `itemsInBlock` map, and then recursively apply current step to each (i.e. call BIA method, add to block, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. call BIA method, add to block, etc.)
Do you mean call "BIA" method or "IBA" method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonerjt good catch. This was originally BIA, but we recently changed the design to use the new plugin type and I forgot to change this one. I'll fix it.
In the `BackupWithResolvers` func, the current Velero implementation iterates over the list of items for backup returned by the Item Collector. For each item, Velero loads the item from the file created by the Item Collector, we call `backupItem`, update the GR map if successful, remove the (temporary) file containing item metadata, and update progress for the backup. | ||
|
||
#### Modifications to the loop over ItemCollector results | ||
The `kubernetesResource` struct used by the item collector will be modified to add an `orderedResource` bool which will be set true for all of the resources moved to the beginning of the list as a result of being ordered resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what's the purpose of adding orderedResource
field to kubernetesResource
?
Please specify the purpose is to respect the setting in the spec of backup CR if that's the case. Please also clarify that all the whether these items of "orderedResources" will be put into one itemBlock.
I see there are several comments about this topic, IMO they should be also reflected in this design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonerjt the reason we need this is that the existing OrderedResource field on the velero spec allows a user to say "we need these specific resources backed up in this order, before anything else is backed up" -- the only way we can do this with the new design is to back up all of the ordered resources first, waiting for each to be done before the next is backed up. While the ItemCollector currently places the ordered resources first in the list, it doesn't provide us any indication as to which resources are "ordered" resources and the remainder which are not, so this "isOrderedResource" bool will tell Velero that this is one to back up synchronously before moving on to the async ones. I'll add some more text to the design to clarify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, or shall we just simplify it by not introducing multi-threading when the OrderedResource
is not nil in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sort of simplification is unnecessary -- a user adds a single "back this up first" item and then we can't back up in parallel no matter how large the backup is. All we need here is to know which resources are the "ordered resources", we can back those up one at a time, and then after that we can use all the workers. It's a small amount of complexity for a potentially large benefit.
|
||
After backing up the items in the block, we now execute post hooks using the same filtered item list we used for pre hooks, again taking the logic from `itemBackupper.backupItemInternal`). | ||
|
||
#### `itemBackupper.backupItemInternal` cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will backupItem
be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonerjt No. BackupItemBlock calls BackupItem in turn on each item in the ItemBlock, so that will remain. It will just be called from a different place than it is now -- and it will also be called within BackupItemInternal on the returned additional items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is related to another comment of mine. I was thinking backupItemBlock
will be part of kubernetesBackupper
and it will replace kubernetesBackupper.backupItem
.
``` | ||
type ItemBlock struct { | ||
log logrus.FieldLogger | ||
itemBackupper *itemBackupper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need different itemBackupper instances in different ItemBlocks?
There are trackers in the itemBackupper struct which collects information through out the backup procedure, if we allow different itemBackuppers we'll also need to consolidate the info collected by each tracker.
I also have doubt about having logger within ItemBlock, but this is trivial...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ItemBackupper is connected to the backup, so we won't want a different instance for each block. The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access. This is the same reason we need the logger -- so that when the worker thread calls BackupItemBlock it can pass in the correct logger for the currentbackup so that the logs get uploaded to the BSL, etc. Basically everything that's currently passed in to BackupItem needs to be made available in the ItemBlock struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access.
Thanks for the reply, IMO we need to ensure the constraint that different itemBlocks belong to one backup share one itemBackupper instance. For example, add a reference to BackupRequest to the struct of itemBlock and add a reference to the itemBackupper to the Or at least add a comment to highlight that in the definition of the ItemBlock struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the comment is probably sufficient here -- there's only one place in the code where ItemBlocks are created, and that will always pass a reference to the existing itemBackupper for the backup.
|
||
Method signature for new func `backupItemBlock` is as follows: | ||
``` | ||
func backupItemBlock(block ItemBlock) []schema.GroupResource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it func(kb *kubernetesBackupper) backupItemBlock(....)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure. It's not an interface method for the Backupper interface, and we're not calling it from one of the specific KubernetesBackupper interface method implementations -- since it's being called from the worker threads, it's probably better as a top-level func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be a part of kubernetesBackupper
, but we can double check during the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So if we added this to kubernetesBackupper, then we'd need to pass a reference to that, in addition to the ItemBlock, into the worker thread -- it might complicate things too much. But I think this is a detail we can work through during the implementation -- if it makes more sense to do it that way, we can consider it.
|
||
## Detailed Design | ||
|
||
### Phase 1: ItemBlock processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ItemBlock only impacts Backup
func of kubernetesBackupper
?
Will it also impact FinalizeBackup
?
Please clarify it in this design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For finalize, I don't think we need to deal with ItemBlocks, since we're just updating items based on completed async actions. I'll update the design to make this clear.
|
||
After implementing backup hooks in `backupItemBlock`, hook processing should be removed from `itemBackupper.backupItemInternal`. | ||
|
||
### Phase 2: Process ItemBlocks for a single backup in multiple threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more details should be added to this section, like how to process the returned value from returnChan
@sseago @shawn-hurley
|
@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up). I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions. So we'd have the following IBA plugins:
|
Signed-off-by: Scott Seago <[email protected]>
1893a1d
to
3eb6804
Compare
@sseago
So could you also add some details for how to create, maintain an itemBlock in various cases into the document? |
It sounds like what's missing is the specific example of an RWX volume shared by multiple pods -- I can add this to the ItemBlock examples in the document -- but to be clear, the only thing necessary to completely handle this use case is an IBA for PVCs which return the list of mounting pods. I'll make sure the document makes that clear as well. |
Thank you for contributing to Velero!
Please add a summary of your change
This PR adds a design document to enhance single backup performance with multithreaded item backup. This design will also implement some of the prerequisites for future support of VolumeGroupSnapshots.
Does your change fix a particular issue?
Fixes #7474
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.