Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design for velero backup performance improvements #7628

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Apr 5, 2024

Thank you for contributing to Velero!

Please add a summary of your change

This PR adds a design document to enhance single backup performance with multithreaded item backup. This design will also implement some of the prerequisites for future support of VolumeGroupSnapshots.

Does your change fix a particular issue?

Fixes #7474

Please indicate you've done the following:

  • [x ] Accepted the DCO. Commits without the DCO will delay acceptance.
  • [x ] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@github-actions github-actions bot added the Area/Design Design Documents label Apr 5, 2024
@sseago
Copy link
Collaborator Author

sseago commented Apr 5, 2024

/kind changelog-not-required

@github-actions github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Apr 5, 2024
@sseago sseago closed this Apr 5, 2024
@sseago sseago reopened this Apr 5, 2024
@sseago sseago force-pushed the backup-perf-design branch 2 times, most recently from e3ccdb2 to b6a814f Compare April 5, 2024 18:03
Copy link

codecov bot commented Apr 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.66%. Comparing base (f85f877) to head (1893a1d).
Report is 159 commits behind head on main.

Current head 1893a1d differs from pull request most recent head 3eb6804

Please upload reports for the commit 3eb6804 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7628      +/-   ##
==========================================
- Coverage   61.88%   58.66%   -3.23%     
==========================================
  Files         266      344      +78     
  Lines       29462    28731     -731     
==========================================
- Hits        18232    16854    -1378     
- Misses       9938    10448     +510     
- Partials     1292     1429     +137     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ywk253100 ywk253100 requested a review from Lyndon-Li April 8, 2024 06:53
design/backup-performance-improvements.md Outdated Show resolved Hide resolved
design/backup-performance-improvements.md Show resolved Hide resolved
design/backup-performance-improvements.md Show resolved Hide resolved

### Per-backup worker pool

The current design makes use of a permanent worker pool, started at backup controller startup time. With this design, when we follow on with running multiple backups in parallel, the same set of workers will take ItemBlock inputs from more than one backup. Another approach that was initially considered was a temporary worker pool, created while processing a backup, and deleted upon backup completion.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are your thoughts on API server throttling due to this parallelism?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an APIServer point of view, I think permanent vs. temporary worker pools should be basically equivalent. Also, from an APIServer point of view, Restore is more likely to hit throttling than backup, since every item restore requires at least two APIServer calls. On the backup side, the Item Collector uses list calls to grab the GVRs included in the backup, so individual calls are only needed for items returned as AdditionalItems that weren't in the original list or by BIA plugins that make their own API calls.

All of this is to say that I suspect even with parallel processing, most backups will make less frequent APIServer calls than an equivalent restore. It doesn't guarantee that we won't run into throttling issues, but in cases where we do, we're probably already hitting them on restore. As for what we need to do, I'm not 100% sure. But since that's as much a potential issue with current restore code, would it make sense to address that for both Backup and Restore as a separate effort?

design/backup-performance-improvements.md Outdated Show resolved Hide resolved
design/backup-performance-improvements.md Show resolved Hide resolved
design/backup-performance-improvements.md Outdated Show resolved Hide resolved
design/backup-performance-improvements.md Outdated Show resolved Hide resolved
- Before loop iteration, create a new `itemsInBlock` map of type map[velero.ResourceIdentifier]bool which represents the set of items already included in a block.
- If `item` is already in `itemsInBlock`, continue. This one has already been processed.
- Add `item` to `itemsInBlock`.
- Load item from ItemCollector file. Close/remove file after loading (on error return or not, possibly with similar anonymous func to current impl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to postpone the item load to the time when backing up the item. Loading the items into memory will take large memory if the number of the total items are huge.

We should keep the item data in disk until we really back it up. If we need to visit the item multiple times during collecting and backup, we should use some cache mechanism.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth considering. Let me go back and look at the structs again, but it may be that instead of passing full unstructured content to the goroutine we should instead pass the struct from the item collector which contains the file path instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in thinking further around this, there's a tradeoff here. We already need to load the full item content from file in order to call GetAdditionalItems, since often plugin logic on what items are needed is based on item content and not just type/name. So if we pass in file path instead, we'll end up loading every item from file twice. Also, since we're using a buffered channel to communicate with the workers, the number of blocks in memory at any given time will be limited. For example, if worker count is 10 and buffer size is 10, then at most 20 blocks will be in memory at a time, even if the backup has thousands of items.

design/backup-performance-improvements.md Show resolved Hide resolved

The worker pool will be started by calling `RunItemBlockWorkers` in `backupReconciler.SetupWithManager`, passing in the worker count and reconciler context.
`SetupWithManager` will also add the input channel to the `itemBackupper` so that it will be available during backup processing.
The func `RunItemBlockWorkers` will create the `ItemBlockWorkerPool` with a shared buffered input channel (fixed buffer size) and start `workers` gororoutines which will each call `processItemBlocksWorker`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer size is critical to the performance right? So should we make the size configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make it configurable. It's not nearly as important as the worker pool size, but it does make sense to make it configurable as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which type of performance are we talking about.

It is critical, as Scott pointed out above, to know the amount of stuff that is "waiting" and what is loaded into memory.

It is not critical, that a larger buffer will make the threads work faster. The only time that could be the case is if you have some number of things that are very fast, faster than the thing to put them in the buffer, and your workers can stall on a hard thing, and then blow through the buffer faster than the pusher can add. I honestly don't think we have this problem.

With this said, I don't see how we explain this option and what different values mean for the end user's performance besides memory in use. I would suggest that we don't make this configurable, and once implemented, we run performance testing to determine if different values here do have an impact. If they do, we will be able to tell the user what that is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was just meaning the performance vs. memory usage. If CPU is sufficient, the faster of processing the more memory will be used. Since we don't know the CPU in users environment, we cannot decide the best value.
I agree with @shawn-hurley, we should do some performance testing, but we can only create a guidance since the CPU and memory in our environment is different from users'.
Furthermore, here we cache each item data in the buffered channel, while the size of items varying from the items themselves, which is also not decidable by testing in our own environment.

design/backup-performance-improvements.md Show resolved Hide resolved
design/backup-performance-improvements.md Show resolved Hide resolved
@draghuram
Copy link
Contributor

I finally got a chance to read the proposal and it looks pretty good. However, I have few comments.

What is the impact of "orderedResources" setting (in Backup/Schedule CR) on the concurrency? It seems to me that if users specify a particular order for resources - say PVCs - backup of the item blocks that contain the ordered resources must be serialized.

Also, I think overall backup performance is determined by how fast CSI snapshot operations and data mover operations are done, rather than by the concurrency of the resource manifest backup. In that sense, the concurrency proposed in this document will only benefit if there are multiple item blocks - each with its own set of PVCs. If that is not the case (say, you are backing up a single namespace with a pod that has 5 PVCs), users won't see much benefit even if the worker count is set to high.

My point in general is that this proposal should be studied from the point of view of its impact on the PVC backup. Apart from what I mentioned above, the concurrency can result in more CSI snapshot API calls, more data mover operations to take place. Perhaps, it is worth explicitly calling it out in the document.

@sseago
Copy link
Collaborator Author

sseago commented Apr 15, 2024

I finally got a chance to read the proposal and it looks pretty good. However, I have few comments.

What is the impact of "orderedResources" setting (in Backup/Schedule CR) on the concurrency? It seems to me that if users specify a particular order for resources - say PVCs - backup of the item blocks that contain the ordered resources must be serialized.

This is actually a great point. I don't think that's accounted for right now. Current design takes the item collector list and generates item blocks from that. orderedResources are moved to the beginning of the list, but to take a specific example, say we have item1-item5 from ordered resources, none of which add additional items via plugins, if we have a worker pool of size 5, that means with the current design, all 5 items will be processed concurrently. I think your proposal would be that everything that comes via orderedResources should be forced to be executed sequentially. If we decide to take this approach, this means (at a minimum), everything that's in orderedResources must belong to the same ItemBlock -- that guarantees that these items will be processed in order, in the same thread.

But that may not be sufficient. I think we may also want to make sure that we don't process any other items until orderedResources are complete. Is this correct? If yes, then I think the following design modification would resolve it:

  1. change the item collector return struct to include a boolean indicating whether each resource in question is part of orderedResources. With the current design, they're currently first, but we don't know when we're done processing them. This would resolve that.
  2. instead of one loop passing over the full item list, we have 2 loops. First loop stops iterating when it hits an item not identified in 1) as an ordered resource. If there are no ordered resources, this means we don't process anything here. We do the usual itemBlock processing, except instead of one block per item, all items identified as ordered resources are appended together into a single ItemBlock. We now back up this one ItemBlock consisting of all OrderedResources synchronously. Once it's done, we then go into the item processing loop described in the current design (with ordered resources already having been removed/handled).

@draghuram does the above resolve this appropriately?
@shawn-hurley what do you think?

Also, I think overall backup performance is determined by how fast CSI snapshot operations and data mover operations are done, rather than by the concurrency of the resource manifest backup. In that sense, the concurrency proposed in this document will only benefit if there are multiple item blocks - each with its own set of PVCs. If that is not the case (say, you are backing up a single namespace with a pod that has 5 PVCs), users won't see much benefit even if the worker count is set to high.

For fs-backup scenarios, with multiple pods, this design will probably help us. It will also help in cases where there are a great many CSI snapshot operations. One case we ran into recently was one where there were a large number of VMs being backed up where the storage capacity for each was relatively small, but the synchronous component of CSI operations added up to many hours for the whole backup -- if we had a pool with 10 workers, that synchronous component would have been greatly reduced.

My point in general is that this proposal should be studied from the point of view of its impact on the PVC backup. Apart from what I mentioned above, the concurrency can result in more CSI snapshot API calls, more data mover operations to take place. Perhaps, it is worth explicitly calling it out in the document.

datamover operations have their own concurrency limitations -- number of node agent pods*per-node-agent concurrency. In backups where that's the limiting factor, this design won't do much. In designs where that is not the limiting factor, I'd expect some improvement here.

Also note that there are two out-of-scope (see Non Goals) enhancements that need parts of this design to build on top of that meet other future needs:

  1. VolumeGroupSnapshot support will require the ItemBlock refactoring along with the BIAv3 support designed here to properly group groups of pods which mount volumes in the same VolumeGroup
  2. Running multiple backups in parallel will make use of the same worker pool defined here to back up ItemBlocks, but with that enhancement, if we have a worker pool of 10, instead of all 10 workers backing up ItemBlocks from a single backup, they'll be backing up ItemBlocks from more than one backup.

@sseago
Copy link
Collaborator Author

sseago commented Apr 16, 2024

@draghuram @shawn-hurley Actually, a slight modification to this suggestion above probably makes sense:

"instead of one loop passing over the full item list, we have 2 loops. First loop stops iterating when it hits an item not identified in 1) as an ordered resource. If there are no ordered resources, this means we don't process anything here. We do the usual itemBlock processing, except instead of one block per item, all items identified as ordered resources are appended together into a single ItemBlock. We now back up this one ItemBlock consisting of all OrderedResources synchronously. Once it's done, we then go into the item processing loop described in the current design (with ordered resources already having been removed/handled)."

Rather than doing this initial OrderedResources ItemBlock completely synchronously, we should probably send it to the worker channel just like we'll do later for the other ItemBlocks -- it's just we won't send any additional ItemBlocks until this one has completed, otherwise once we get to multiple backups at once, the worker count for "how many items to back up at the same time" will be off by one while this synchronous ItemBlock backup is happening. Instead, we should probably just send this first block to the channel, wait for it to complete, and then do the second pass through everything as proposed above.

@ywk253100
Copy link
Contributor

The k8s resources are already downloaded from the API server to the local disk by the item collection process before the concurrent processing proposed by this design, so seems the concurrency isn't very necessary.

And I agree with @draghuram that the overall backup performance is determined by how fast the data is handled, not the k8s resources.

@draghuram
Copy link
Contributor

@sseago I was thinking that you would still create multiple item blocks spanning ordered resources, but you would process them in order, rather than sending them to the workers concurrently. Your proposal of using one item block would work too but the difference is that other item block level operations such as hooks may be affected by putting all the resources in one item block. By that I mean, we may be executing hooks for a group of pods that may not be strictly related. To be sure, I don't now if this is such a big concern. I haven't used "orderedResources" feature myself a great deal.

Also, I agree with you that the proposal helps with the performance by parallelizing operations such as CSI and other time taking operations. BTW, were you referring to KubeVirt plug-in when you said "VM operations"? And yes, item block functionality would come handy when it is time to implement volume group snapshots.

@sseago
Copy link
Collaborator Author

sseago commented Apr 16, 2024

@draghuram Ahh, yes, you're right. If pods are included, we'd want to treat them as separate blocks (since hooks run before/after ItemBlocks) -- so basically, we'd still need to identify which items in the collector output come from ordered resources, so we first process the ordered resources, creating ItemBlocks normally, but passing one at a time to be processed, waiting for the response before going to the next, once we hit the first ItemBlock without ordered resources, we can start processing them in parallel.

@ywk253100 See the comment above, but we've seen scenarios where if there are a large number of volumes, the backup took many hours due to the synchronous first part of the CSI backup process -- that's one example where processing items in parallel should shorten overall backup time. Also note the other two use cases which this design enables:

  1. Running multiple backups concurrently (since the worker pool created in this implementation will be shared across backups when the follow-on work is done, building on top of this
  2. Supporting VolumeGroupSnapshots will require the Phase 1 work here (ItemBlock refactor), but not the concurrency.


The worker pool will be started by calling `RunItemBlockWorkers` in `backupReconciler.SetupWithManager`, passing in the worker count and reconciler context.
`SetupWithManager` will also add the input channel to the `itemBackupper` so that it will be available during backup processing.
The func `RunItemBlockWorkers` will create the `ItemBlockWorkerPool` with a shared buffered input channel (fixed buffer size) and start `workers` gororoutines which will each call `processItemBlocksWorker`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was just meaning the performance vs. memory usage. If CPU is sufficient, the faster of processing the more memory will be used. Since we don't know the CPU in users environment, we cannot decide the best value.
I agree with @shawn-hurley, we should do some performance testing, but we can only create a guidance since the CPU and memory in our environment is different from users'.
Furthermore, here we cache each item data in the buffered channel, while the size of items varying from the items themselves, which is also not decidable by testing in our own environment.

// responses from BackupItemBlock calls are in responses
```

The ItemBlock processing loop described above will be split into two separate iterations. For the first iteration, velero will only process those items at the beginning of the loop identified as `orderedResources` -- when the groups generated from these resources are passed to the worker channel, velero will wait for the response before moving on to the next ItemBlock. This is to ensure that the ordered resources are processed in the required order. Once the last ordered resource is processed, the remaining ItemBlocks will be processed and sent to the worker channel without waiting for a response, in order to allow these ItemBlocks to be processed in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean the items identified as orderedResources are always processed sequentially?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li Yes, that is the intent. Since the meaning of this field is "please back up these resources in this exact order", they must be backed up in a single thread -- so we'll process them normally, but will only dispatch one at a time to the worker pool to ensure that they aren't backed up out of order.

@Lyndon-Li
Copy link
Contributor

@sseago
Firstly, let's go back to the requirements and see what the current design helps:
1. Large number of Kubernetes resource items backup
The current design cannot solve the bottleneck. The bottleneck in this case lays on the resource item enumeration and cache (as the current behavior, write a single/small file for each item). However, the current design focus on the workflow after the items are cached

2. Volume group snapshot
The current design helps in this case, so specifically, the idea of ItemBlock helps.
Merely, we need to discuss the priority of this requirement, since the upstream feature is in Beta and even if it is GA, it takes time for the storage providers to support it.

3. Large number of volumes backup
If there are large number of volumes in one backup and we need to take snapshot for each of them, it will take a long time. So this design (work group thread pool + item buffer channel) helps in this case.

4. The CSI snapshot corner case --- the CSI driver retries for a long time before CSI handle appears
The current design helps in this case though we identify it as a corner case.

5. Parallel backups
This design may helps for parallel backups feature, however, we also believe that this is not the only way to solve it. So wether we use the ItemBlock in this design or we design it separately is still TBD. cc @reasonerjt

Secondly, let me clarify the major concern from us:
We think the introduction of BIAv3 and the new constraints it brings to the open architecture of BIA is a big problem:

  1. It is not practical. As the current design, once there is one BIAv2/BIAv1, the BIAv3 will degraded to BIAv2. While, except the BIA plugins managed by Velero, the most BIA plugins are managed by vendors/community, so if we go in this way, we are afraid BIAv3 won't be fully functional for a very long time
  2. The current GetAdditionalItems interface may not be the best idea. It requires to know the full content of each item, which makes it almost the same as the backup process, however, ideally, it is a discovery/collecting process which could be done during the time of backup's collection or pre-backup.

Therefore, we suggest below adjustment:

  1. Narrow down the scope -- we only consider the group of volumes and parallelism of volume groups. We can still have the name of ItemBlock, but we only take pods and PVCs/PVs.
  2. We don't introduce BIAv3, instead, we implement it in the current backup workflow of Velero server.
  3. The idea of GetAdditionalItems is only applied to pods (or some well-known resources that link to volumes) --- by the time of backing up a pod, we interpret its content and get the volumes and then create the ItemBlock dynamically.

@sseago
Copy link
Collaborator Author

sseago commented Apr 23, 2024

Secondly, let me clarify the major concern from us: We think the introduction of BIAv3 and the new constraints it brings to the open architecture of BIA is a big problem:

  1. It is not practical. As the current design, once there is one BIAv2/BIAv1, the BIAv3 will degraded to BIAv2. While, except the BIA plugins managed by Velero, the most BIA plugins are managed by vendors/community, so if we go in this way, we are afraid BIAv3 won't be fully functional for a very long time

@Lyndon-Li I spent some time discussing this with @shawn-hurley today, and I think we may be able to drop the "everything must be v3" requirement, as long as we document clearly for plugin authors that for any pod-related dependencies that would break if the deps were done in parallel, they need to provide the v3 plugin. Effectively, the only use case that we know would break here would be a scenario with multiple pods that need to be backed up together. But that's a use case that's already somewhat broken, since inter-pod dependencies and hooks aren't handled at all right now.

  1. The current GetAdditionalItems interface may not be the best idea. It requires to know the full content of each item, which makes it almost the same as the backup process, however, ideally, it is a discovery/collecting process which could be done during the time of backup's collection or pre-backup.

On your second concern, we're adding the new call after velero already has the full content in the current code, which we need anyway to back up the item. The only extra memory usage, then, is related to the buffer size of the channel -- if we have a buffer size of 10, then 10 ItemBlocks have content in-memory, which is a relatively small amount of memory, all things considered, especially since Kubernetes metadata is limited in size for each resource. You cannot have arbitrarily large k8s resources.

Also, since we already have the logic in the existing BIAs to identify items needed (see the pod_action upgrade example in the design), we're just moving logic around to reuse it. If we were to move this into discovery, we'd need to replicate that logic elsewhere and modify the item collector output to be block-aware. It also would not be extensible at all, but we really do need to do this in a plugin-aware way (more on that below).

The main objection we have to hard-coding the logic around pod->PVC dependencies and the multi-pod dependencies for VolumeGroup support is that this prevents people from extending this concept via plugins. There are other use cases that require pods to be backed up together that internal Velero logic won't know about. For example, if I'm maintaining an operator that I want my users to be able to back up with Velero, with the plugin approach, I can write a BackupItemAction which operates on pods in my operator that return the other pods in the GetAdditionalItems method, and this will ensure that 1) pod hooks are run for my operator pods before and after the group of pods is backed up, and 2) these pods and their associated PVCs will be backed up in a single worker, since they're in one ItemBlock.

@sseago
Copy link
Collaborator Author

sseago commented Apr 23, 2024

To summarize my last comment, regarding the two major points of disagreement:

  1. BIA fallback to v2 if not all plugins are v3: at this point, we're willing to make that change as long as we document clearly the sorts of cases that plugin authors need to keep in mind.

  2. Need for GetAdditionalItems BIAv3 plugin API method: Here, I think we need to stick with it. Your objection to implementing it is mostly that Velero would use more memory than you'd like. My objection to skipping it is that if we hard-code the ItemBlock grouping dependencies, we prevent plugin authors from being able identify resources that need to be backed up together in a single thread (and have coordinated hooks), and if we create a completely new ItemCollectorPlugin to manage the grouping elsewhere, we add significant complexity vs. just adding one new method to an existing infrastructure, as well as complicating the process for plugin authors, now required to maintain a separate set of plugins.

Or an even shorter summary: on point 1 above, we agree with you now and will update the design, but on point 2, we're going to keep the BIAv3 as-is, since it enables some use cases that we need which aren't handled without it.

@sseago
Copy link
Collaborator Author

sseago commented Apr 23, 2024

A little more detail on the concerns around calling the new plugin methods after item collection is done. First, as for the objection that we need the full object in memory -- we already need it in memory to process it a bit later when it gets to the worker thread, so worst case scenario is we're holding on to a few of these longer than we would be without ItemBlocks. How much memory are we talking about? Kubernetes resource metadata is limited in size -- somewhere around 1mb max. So we're talking about resourceSizebufferSizeitemsPerBlock -- if we assume a buffer size of 10 and 5 items per block (I expect most items will be in blocks of one, and a few items will be in blocks larger than 5) -- this means we're holding an extra 50mb of item data in memory, at most.

As for the suggestion of doing this during item collection, there are a couple of problems with that. Currently the item collector grabs one GroupResource at a time via a List call, then dumps each item to disk, then moves on to the next. While on the surface it might make sense to process item dependencies here, there are some problems:

  1. Things are not ordered yet -- both the OrderedResources field and the general ordering of pods, then PVCs, then PVs needs to be applied before grouping the items, since the Pods pull in their PVCs, and PVCs pull in their PVs -- out of order, and this may not happen properly.
  2. Once we get the related items for the current item, we need to get the related item for those items, but in the item collector we're just making the first pass to grab things from the cluster and store them in the indexed temp files, so the associated items may not have even been loaded yet -- there's a good chance we don't have the data we need to process the full item block.

By processing the ItemBlock at the end of item collection, the items we need will already be loaded from the cluster and referenced in the items list, so when the current item returns "item 2" as a needed item, we can just grab that item from the list, mark it as "in a block already" (so later it won't be added to another one), and then call this item's additional items method, etc. This can all be done with minimal disruption to the rest of the workflow, and we're limiting the number of items in memory to the size of the input channel's buffer. This also preserves the ability for plugin authors to create their own ItemBlocks as necessary for their backup needs.

@sseago
Copy link
Collaborator Author

sseago commented Apr 24, 2024

I've updated the design based on the conversations earlier this week:

  1. I've removed the requirement that all registered BIAs implement v3 in order to make use of the new functionality.
  2. I've added a section to "Alternatives considered" to discuss the alternate approach of adding a new ItemBlockAction plugin instead of extending BIA to v3. It changes a few things in the overall design, but most things are unaffected.

@sseago
Copy link
Collaborator Author

sseago commented Apr 29, 2024

Concerns about the design from review and meeting comments

At this point, there seem to be two primary objections to the design in the review (and community meeting) comments:

  1. Creating a new plugin API for identifying ItemBlock contents is too complicated when this can be done for internal velero use cases in a simpler manner by hard-coding the logic directly into the velero backup workflow.
  2. External use cases which would require a plugin approach don't exist today -- they're hypothetical future use cases that we don't understand well enough to design it into a plugin now. We can add the new plugin API in the future once we have a real use case.

Responses to concern 1

  • There is already too much special-case logic in the backup workflow. Adding more conditional statements if groupResource == kuberesource.Pods into the code to handle all of the known cases for Pod->PVC, PVC->PV and (soon) Pod->Pod (where we need to find all the pods which mount volumes in the same VolumeGroup) will add significant complexity to the workflow logic, especially if we attempt to add in new use cases as they come up. Using a plugin will encapsulate this complexity elsewhere, allowing the backup workflow to treat all resources in exactly the same way when generating ItemBlocks. This will make the flow much more readable.
  • In addition, if we do this via plugin, then no additional work is needed to handle the external use cases, as we will have already provided a plugin interface to do it.
  • If we take the internal/hard-coded approach, when we do eventually add the plugin API changes in a future release, we will either have something even more complicated than we started with, having to mix the internal inline logic with the newly-added plugin logic, or we'll have to re-implement the internal logic using the plugin APIs, meaning we'll have to implement this logic twice. If we just did the plugin work initially, we'll be done.

Responses to concern 2

  • The plugin API proposed is generic enough to handle any external use case where the items associated with the current item can be determined from the content of the current item. The API method takes in the current item content (and backup CR), and returns a list of items. This should be sufficient to handle even those use cases which we don't know about yet. That's why the API proposed is generic and not tied to a specific use case, since the whole point of plugin interfaces is to allow use cases that are not built into the application directly.
  • All of the potential uses we've been able to come up with so far are easily handled with the proposed API.
  • Even the not-yet-existing implementation of VolumeGroupSnapshot support would be handled by the proposed plugin API, which is an important data point suggesting that the API as proposed is a useful one.
  • See below for an actual real-world use case for an application that people are already deploying that would need additional plugin APIs to back up properly with hooks (Cassandra).
  • We need the plugin to handle use cases we don't have yet, because that allows users who run into these things to have a way to back up their workloads now, rather than having to create in an issue and wait a few months for Velero to support this in the next release.

Specific use cases that would require the proposed plugin

Cassandra (cass-operator)

For users of Cassandra, cass-operator creates a StatefulSet with three pods, each of which has its own mounted volume. With a plugin API to associate related entities, we can make sure that all three pods are included in the same ItemBlock. The internal-to-velero pod plugin will make sure that the volumes for each pod are included. This will make sure that pre-backup hooks on all three Cassandra pods are run first, followed by CSI snapshot creation, followed by the post-backup hooks.

Alternate design considerations

New plugin type ItemBlockAction

On the last community meeting, Daniel suggested we might consider creating a new ItemBlockAction plugin type instead of extending BackupItemAction. While this will require more upfront work than extending BIA, once the plugin API is added, there's very little in the overall design here that's changed. I expect that it will work just as well (no better, no worse) as the proposed BIAv3 design. The work required for plugin authors is also approximately the same. Because of this, while I personally have a mild preference for the BIAv3 design, I don't have any strong objections to the ItemBlockAction approach. This has been added to the "Alternatives considered" section of the design.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented May 7, 2024

@sseago
@blackpiglet and me discussed further this design and we have some common feelings/questions as below:

  1. We at least have two item grouping requirements --- a. user provided grouping which shows no rationality for why grouping the items together; b. automatically grouping items logically associated, e.g., pod-pvc pairs.
  2. For user provided grouping, we cannot solve all the scenarios with the BIA and GetAdditionalItem method. E.g., as a valid performance improvement case, I want to group every 100 items which these items have no relationship with each other. How do we handle this case?
  3. For logically associated item grouping, the backup item collection is with alphabet order, we cannot assure the higher level item is visited before the lower level item. E.g., for pod-pvc pair, it is possible that PVC is visited first, then we may create an item block for the PVC; then we visit the pod, we may create another item block for the pod and then get the PVC by GetAdditionalItem method. Finally, we have two item blocks which should be merged together. How do we handle this case?
  4. Why do we need to have all the items in the item blocks initially? We think we only need to group the "top level" items, then lower level items go into the same item block naturally since they will be returned by BIA' Execute. If so GetAdditionalItem method is not required. If the only reason to do this is to avoid conflict of two item blocks (items are in more than one item blocks), we don't think it is a good bargain as it changes things heavily and much more complex

Finally, we want to share our suggestions as below:

  1. We make the collection/creating of item blocks into two steps.
  2. For the first step, which happens in the existing item collection code, we don't use BIA. We create the item blocks according to user input (for user provided grouping) for the top level items. Top level items mean the well known items with no dependencies. If we don't know whether a item is a top level item, we don't support group of it. After this step, we've created all the item blocks
  3. If point 4 above is necessary, we do the second step -- we call GetAdditionalItem method for existing items in each block, but during this step, no new item block is created. This can be done in parallel since we have fixed number of item blocks

@sseago sseago force-pushed the backup-perf-design branch 2 times, most recently from 04e7449 to 1893a1d Compare May 8, 2024 21:28
@sseago
Copy link
Collaborator Author

sseago commented May 13, 2024

@sseago @blackpiglet and me discussed further this design and we have some common feelings/questions as below:

  1. We at least have two item grouping requirements --- a. user provided grouping which shows no rationality for why grouping the items together; b. automatically grouping items logically associated, e.g., pod-pvc pairs.

I think this distinction is incorrect. All of the groupings are for logically associated items. The distinction is as follows:

  1. Related item associations known to Velero developers and applies to core kubernetes resources (pod-pvc-pv, etc.)
  2. Related item associations not known to velero developers or only apply to external types or applications (see the cass-operator example given)

In particular, the external groupings are related to item dependencies/associations, just like the internal ones. Also, these aren't provided by end users but by application or infrastructure developers. End users are not expected to know or understand these dependencies at the individual resource level.

  1. For user provided grouping, we cannot solve all the scenarios with the BIA and GetAdditionalItem method. E.g., as a valid performance improvement case, I want to group every 100 items which these items have no relationship with each other. How do we handle this case?
  1. Lets make a distinction here: Why do we need ItemBlocks? To allow us to make performance improvements such as backing up items in parallel. What belongs to a particular ItemBlock? That is determined by resource interdependencies and the need for data integrity of the backup. In other words, what goes into a particular ItemBlock relates to data integrity concerns, so "I want to group every 100 items" is irrelevant here. If we determined that this sort of grouping actually made things faster (and I don't think it will), then that would be resolved by backing up 100 ItemBlocks together, not by shoving 100 unrelated items into an ItemBlock.
  2. These aren't "user-provided" but "application/infrastructure developer provided", so requiring end users to define them per-backup does not meet the requirements
  3. The ItemBlockAction(IBA)/GetRelatedItems enhancement proposed does solve all the scenarios. It's flexible enough to allow us to handle any interdependencies that can be determined from the item itself.
  1. For logically associated item grouping, the backup item collection is with alphabet order, we cannot assure the higher level item is visited before the lower level item. E.g., for pod-pvc pair, it is possible that PVC is visited first, then we may create an item block for the PVC; then we visit the pod, we may create another item block for the pod and then get the PVC by GetAdditionalItem method. Finally, we have two item blocks which should be merged together. How do we handle this case?
  1. For pod/pvc/pv cases in particular, this is handled by the ItemCollector by ordering Pods, PVCs, and PVs (in that order) before the alphabetical grouping of everything else.
  2. For any solution requiring plugins that is also pod-centric (for example the cass-operator case described already), the existing order, starting with pods, handles it alreaedy. Basically, for any cass-operator pod, the registered cass-operator plugin will return the other pods for any pod belonging to a cass-operator ReplicaSet, so whichever order the pods are processed in, the first one to be processed will group all of them into the same ItemBlock.
  3. For any external plugin situation not involving pods, the plugin author should express the dependency in both directions -- if the block includes Item1 and Item2, then Item1's GetRelatedItems returns Item2, and Item2's GetRelatedItems returns Item1. Whichever is processed first will add both to the ItemBlock, so we are guaranteeed that both items will be in the same block.
  1. Why do we need to have all the items in the item blocks initially? We think we only need to group the "top level" items, then lower level items go into the same item block naturally since they will be returned by BIA' Execute. If so GetAdditionalItem method is not required. If the only reason to do this is to avoid conflict of two item blocks (items are in more than one item blocks), we don't think it is a good bargain as it changes things heavily and much more complex
  1. We need them initially because otherwise two ItemBlocks could run concurrently and end up backing up two items that should be in the same ItemBlock separately instead. For example:
    • Lets say that we're processing Pods from the item collector, so we put each pod in an ItemBlock. If we're not calling GetRelatedItems, then we don't yet know that they may have PVCs that belong in the same block.
    • If we only have one pod, that one goes into block1, add it to the worker pool input channel and move to next item in collector output.
    • Next up is a PVC, so we group it into its own pod, add it to the worker channel.
    • Two workers grab these two items and start processing them. Pod and PVC (which should have been in the same block) start processing at the same time.
    • block1 runs pod hooks, then calls BIAs on the pod, the existing BIAv1 Execute returns the PVC as an additional item
    • parallel to the above, block2 starts backing up the PVC, starting snapshot creation -- this may start before the pre hooks complete
    • block1 processing takes the PVC additional item return, calls backupItem on it, which returns immediately, since this PVC is alreay being backed up it is listed in the BackedUpItems map.
    • post hooks run on the pod (snapshot creation isn't done yet, since that's being done in the wrong worker thread)
    • block1 is done
    • block2 completes
  2. Basically, if we don't have all the items in the block, we gain no value from the ItemBlock concept. The ability to have a complete block (except for items that don't actually exist yet, such as VolumeSnapshots) was the motivation to create this in the first place -- it's how we prevent out-of-order processing like in the above example.
  3. "we don't think it's a good bargain as it changes things heavily an much more complex" --
    handling dependencies is inherently more complex than not handling them. If we don't introduce new complexities, then we will be unable to resolve the provided use cases. Is there any disagreement that we must solve the use cases in front of us, even if it introduces some new complexities?

Finally, we want to share our suggestions as below:

  1. We make the collection/creating of item blocks into two steps.
  2. For the first step, which happens in the existing item collection code, we don't use BIA. We create the item blocks according to user input (for user provided grouping) for the top level items. Top level items mean the well known items with no dependencies. If we don't know whether a item is a top level item, we don't support group of it. After this step, we've created all the item blocks

This can't come from user input -- it comes from application/operator developers, which means even if we did it during item collection, it would need to be a plugin of some sort. That plugin would need to take in a resource (with content) and return a list of items that need to be associated with it. So doing this in the ItemCollector won't be any simpler than defininig a new ItemBlockAction plugin which takes in the exact same thing and returns the same list of resources. Doing it in the item collector would be more complicated, though, since then we'll need to change the whole output structure to return groups of items rather than just a list, and on top of that since the associated items may not have been listed from the cluster yet, we would need to pull them out individually with a Get call, and then we'd need to check for every future item as we collect it to make sure it's not already in a separate group.

Essentially we'd be taking all of the item block processing logic that the design proposes post-item-collection and do it during item collection, but since we don't have the full item list at this point, it's much less efficient. Is there any disagreement here?

  1. If point 4 above is necessary, we do the second step -- we call GetAdditionalItem method for existing items in each block, but during this step, no new item block is created. This can be done in parallel since we have fixed number of item blocks

We don't have a fixed number of ItemBlocks because as we process the list we get from the collector, we end up adding items from further down into the current block. By the time we have teh full list of item blocks, we are almost done with processing. More to the point, items returned by the IBA plugin API call don't create new ItemBlocks, they add later items to the current block.

@shawn-hurley
Copy link
Contributor

Hey folks,

I want to add a small summary to try and see if we can get agreement on some points. The goal here is to lay out some high-level concepts to make sure that we agree with what is proposed and that it solves the problem.

Points

  1. Item blocks solve two problems
    1. They allow for the grouping of resources that must be run in a way where all pod hooks are run, then backed up, and then post hooks are run. This allows us to get closer to crash consistency. While Volume Group Snapshot (VGS) is the only way to do this with 100% certainty, it might not be available for every CSI driver. This helps us fix long-standing hidden issues in the current backup flow.
    2. This allows for the safe processing of parts of a backup in parallel. @sseago has laid out why we need to be concerned about this.
  2. Creating Item Blocks should happen after item collection with a plugin.
    1. The first reason is exposing this for other cluster components and/or operators to move their applications to more crash consistency.
    2. A generic way to determine this grouping will make the code simpler as there will be one time and place it happens, instead of hard-coded logic potentially all over the place.
    3. Doing this in item collection will make it harder to implement both the plugin and the logic to create blocks. It will also make the whole process slower because once a block is created, we will not be able to send it to start the backup when we start processing in parallel.
  3. Processing backup items in parallel will make the system faster.
    1. Processing 1 item in a thread with a buffer will make the system faster. The only reason to group items for processing in a thread is safety, as outlined above. Grouping items for any other reason will cause a slowdown.
    2. having threads handle item blocks instead of individual items will speed up the system considerably.
    3. While there is a potential for some extra memory usage, we are talking about a single struct with HOPEFULLY read-only pointers to things already in memory. Because of this, I suspect that memory usage should be negligible.

I hope that we can gain agreement on these three points, if we can see what we agree on and don't agree on to make the disagreements clear from here. I am worried that I am not clear which points we have discussion on and what we agree on so a level set at this level will be helpful.

#### Proto definition (compiled into golang by protoc)

The ItemBlockAction plugin type is defined as follows:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all code blocks are Go code examples, right?

would add language to these code blocks for syntax highlighting, to improve reading

```                ```go
code      --->     code
```                ```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mateusoliveira43 most of them, yes. I'll add it where appropriate. The one exception is the protobuf API spec section.


## High-Level Design

### Phase 1: ItemBlock processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wish to suggest that a section be added in High-Level design to highlight what an "ItemBlock" is, what constraints it has, and how velero will handle such ItemBlocks.
This will improve the readability of this design and help other adopter determine whether they wanna implement an "IBA" and what will be put into an ItemBlock

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll add that.


A new PluginKind, `ItemBlockActionV1`, will be created, and the backup process will be modified to use this plugin kind.

For any BIA plugins which return additional items from `Execute()` that need to be backed up at the same time or sequentially in the same worker thread as the current items should add a new IBA plugin to return these same items (minus any which won't exist before BIA `Execute()` is called).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed? If the BIA is called in a thread which backs up items in an itemBlock, these additional items will be handled within BackupItemBlock right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because this is the only way we make sure we guarantee that the items get backed up in the same worker thread at the same time. If we don't have an IBA action which adds these items to the item block, then all of the items will be considered as belonging to separate ItemBlocks. In some scenarios things will work just fine, but in others they may not.

As an example, lets take a backup with Pods, PVCs, and PVs and nothing else -- each Pod contains one mounted PVC bound to a PV. If we create IBA plugins for Pod and PVC, then when we create ItemBlocks, each block will contain one pod, one PVC, and one PV. If there's just one pod, there's one item block. If there are 100 pods, there are 100 ItemBlocks. If we don't create the new IBA plugins for Pod and PVC, then Velero will just have isolated ItemBlocks with Pods, then PVCs, then PVs, since the IBA action is what groups items into blocks.

Now lets look at two scenarios -- the first will work as you expect (additional items will be handled within BackupItemBlock), and in the second they may not.

Scenario 1: Number of pods is much larger than number of ItemBlock workers.
In this case, say we have 100 pods and 5 workers. While the pre-calculated ItemBlocks will only have one item each, the Pod blocks will be first, followed by the PVC blocks, then the PV blocks. So the first 5 pod blocks get sent to the 5 workers. As each is processed, the Execute() call pulls in the additional items, so we get each pod backed up with its PVC and PV, and all 3 items are marked as already backed up. As one finishes, another Pod block is pulled off the queue. By the time the first PVC block gets pulled in, 96 out of 100 pods (along with their PVCs) have already been backed up, so this PVC has a 96% chance of already being backed up -- the first thing Velero does in BackupItem is to see whether this item has already been backed up, and if it has, it exits immediately. The block is done, pull the next, etc. It's possible that all 100 PVCs will have already been handled before their PVC blocks arrive in the worker pool -- at most, 4 of these may not have, in which case PVC backup and pod hooks may run out-of-order -- which is the risk when we don't have an IBA plugin defined if there are interdependencies.

Scenario 2: There are many more worker threads than pods
In this case, lets assume there are only 2 pods but 10 workers. In this case, both pods, both PVCs, and both PVs are all sent to workers at the same time. So the likelihood is high that by the time the Pod plugin gets that PVC additional item return, it's already been started backing up in another worker, so it will be skipped here. This means, especially if it's a large volume, that the post hook is likely to run before the PVC snapshot has been created.

Creating the IBA plugin eliminates this problem entirely. This is the whole reason we have the IBA plugin -- it enables us to create ItemBlocks based on item interdependencies. If we don't have this, then we can't create meaningful ItemBlocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I didn't consider "scenario 2".
But in regards to the question of whether a BIA plugin returning additionalItems should have an IBA to group these items earlier, IMO the answer is till That depends. B/C sometimes the additional items do not have to be in the same itemBlock with the "original item". Another possibility is that some of these additional items may be created within the Execute() function of BIA.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt "Another possibility is that some of these additional items may be created within the Execute() function of BIA." -- yes, this is called out explicitly as a case where the items will not be in the ItemBlock. Actually, your "that depends" is also covered in the text that we're responding to now. "For any BIA plugins which return additional items from Execute() that need to be backed up at the same time or sequentially in the same worker thread as the current items should add a new IBA plugin to return these same items" -- the key part here is "that need to be backed up at the same time or sequentially in the same worker thread" -- so "sometimes the additional items do not have to be backed up in the same ItemBlock with the original item", in that case they do not need to be in the IBA. What you said in your comment and what's in the design say the same thing, just in different words.

- If `item` is already in `itemsInBlock`, continue. This one has already been processed.
- Add `item` to `itemsInBlock`.
- Load item from ItemCollector file. Close/remove file after loading (on error return or not, possibly with similar anonymous func to current impl)
- Get matching IBA plugins for item, call `GetRelatedItems` for each. For each item returned, get full item content from ItemCollector (if present in item list, pulling from file, removing file when done) or from cluster (if not present in item list), add item to the current block, add item to `itemsInBlock` map, and then recursively apply current step to each (i.e. call BIA method, add to block, etc.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. call BIA method, add to block, etc.)

Do you mean call "BIA" method or "IBA" method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt good catch. This was originally BIA, but we recently changed the design to use the new plugin type and I forgot to change this one. I'll fix it.

design/backup-performance-improvements.md Show resolved Hide resolved
In the `BackupWithResolvers` func, the current Velero implementation iterates over the list of items for backup returned by the Item Collector. For each item, Velero loads the item from the file created by the Item Collector, we call `backupItem`, update the GR map if successful, remove the (temporary) file containing item metadata, and update progress for the backup.

#### Modifications to the loop over ItemCollector results
The `kubernetesResource` struct used by the item collector will be modified to add an `orderedResource` bool which will be set true for all of the resources moved to the beginning of the list as a result of being ordered resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what's the purpose of adding orderedResource field to kubernetesResource?
Please specify the purpose is to respect the setting in the spec of backup CR if that's the case. Please also clarify that all the whether these items of "orderedResources" will be put into one itemBlock.


I see there are several comments about this topic, IMO they should be also reflected in this design.

Copy link
Collaborator Author

@sseago sseago May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt the reason we need this is that the existing OrderedResource field on the velero spec allows a user to say "we need these specific resources backed up in this order, before anything else is backed up" -- the only way we can do this with the new design is to back up all of the ordered resources first, waiting for each to be done before the next is backed up. While the ItemCollector currently places the ordered resources first in the list, it doesn't provide us any indication as to which resources are "ordered" resources and the remainder which are not, so this "isOrderedResource" bool will tell Velero that this is one to back up synchronously before moving on to the async ones. I'll add some more text to the design to clarify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, or shall we just simplify it by not introducing multi-threading when the OrderedResource is not nil in the spec?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sort of simplification is unnecessary -- a user adds a single "back this up first" item and then we can't back up in parallel no matter how large the backup is. All we need here is to know which resources are the "ordered resources", we can back those up one at a time, and then after that we can use all the workers. It's a small amount of complexity for a potentially large benefit.


After backing up the items in the block, we now execute post hooks using the same filtered item list we used for pre hooks, again taking the logic from `itemBackupper.backupItemInternal`).

#### `itemBackupper.backupItemInternal` cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will backupItem be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt No. BackupItemBlock calls BackupItem in turn on each item in the ItemBlock, so that will remain. It will just be called from a different place than it is now -- and it will also be called within BackupItemInternal on the returned additional items.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is related to another comment of mine. I was thinking backupItemBlock will be part of kubernetesBackupper and it will replace kubernetesBackupper.backupItem.

```
type ItemBlock struct {
log logrus.FieldLogger
itemBackupper *itemBackupper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need different itemBackupper instances in different ItemBlocks?
There are trackers in the itemBackupper struct which collects information through out the backup procedure, if we allow different itemBackuppers we'll also need to consolidate the info collected by each tracker.

I also have doubt about having logger within ItemBlock, but this is trivial...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ItemBackupper is connected to the backup, so we won't want a different instance for each block. The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access. This is the same reason we need the logger -- so that when the worker thread calls BackupItemBlock it can pass in the correct logger for the currentbackup so that the logs get uploaded to the BSL, etc. Basically everything that's currently passed in to BackupItem needs to be made available in the ItemBlock struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access.

Thanks for the reply, IMO we need to ensure the constraint that different itemBlocks belong to one backup share one itemBackupper instance. For example, add a reference to BackupRequest to the struct of itemBlock and add a reference to the itemBackupper to the Or at least add a comment to highlight that in the definition of the ItemBlock struct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the comment is probably sufficient here -- there's only one place in the code where ItemBlocks are created, and that will always pass a reference to the existing itemBackupper for the backup.


Method signature for new func `backupItemBlock` is as follows:
```
func backupItemBlock(block ItemBlock) []schema.GroupResource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it func(kb *kubernetesBackupper) backupItemBlock(....)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure. It's not an interface method for the Backupper interface, and we're not calling it from one of the specific KubernetesBackupper interface method implementations -- since it's being called from the worker threads, it's probably better as a top-level func.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be a part of kubernetesBackupper, but we can double check during the implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So if we added this to kubernetesBackupper, then we'd need to pass a reference to that, in addition to the ItemBlock, into the worker thread -- it might complicate things too much. But I think this is a detail we can work through during the implementation -- if it makes more sense to do it that way, we can consider it.


## Detailed Design

### Phase 1: ItemBlock processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ItemBlock only impacts Backup func of kubernetesBackupper?

Will it also impact FinalizeBackup?

Please clarify it in this design.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For finalize, I don't think we need to deal with ItemBlocks, since we're just updating items based on completed async actions. I'll update the design to make this clear.


After implementing backup hooks in `backupItemBlock`, hook processing should be removed from `itemBackupper.backupItemInternal`.

### Phase 2: Process ItemBlocks for a single backup in multiple threads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more details should be added to this section, like how to process the returned value from returnChan

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented May 21, 2024

@sseago @shawn-hurley
One more question about a concrete case:

  • We have Pod1->PVC1->PV1 and Pod2->PVC2->PV1, that is PV1 is a ReadWriteMany/ReadOnlyMany PV mounted to both Pod1 and Pod2.
  • What happens with this case? Ideally, Pod1, Pod2, PVC1, PVC2 and PV1 should be in the same itemBlock since we need the hooks for both Pod1 and Pod2 to be executed before backing up the PV. Can we achieve this and how?

@sseago
Copy link
Collaborator Author

sseago commented May 22, 2024

@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up).

I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions.

So we'd have the following IBA plugins:

  1. pod plugin: returns a list of mounted PVCs
  2. pvc plugin: returns a list of bound PVs and (new addition to handle RWX/ROX) a list of pods that this PVC belongs to.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented May 23, 2024

@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up).

I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions.

So we'd have the following IBA plugins:

  1. pod plugin: returns a list of mounted PVCs
  2. pvc plugin: returns a list of bound PVs and (new addition to handle RWX/ROX) a list of pods that this PVC belongs to.

@sseago
Let's say the IBA is executed in order of Pod1, Pod2, PVC1

  1. After IBA of Pod1, an itemBlock (ItemBlock1) is created with Pod1, PVC1, PV1
  2. After IBA of Pod2, some more additional items are returned PVC1, PV1. What are we going to do by this time? Will we create another itemBlock, or we merge Pod2 to the existing ItemBlock1? We may not merge them by this time right? Otherwise, we will not need IBA of PVC to return its associated pods?
  3. After IBA for PVC1, from what you mentioned, the additional items are Pod1, Pod2, PV1, looks like we need to consolidate all the items together right? What algorithm do we use to do this?

So could you also add some details for how to create, maintain an itemBlock in various cases into the document?

@sseago
Copy link
Collaborator Author

sseago commented May 28, 2024

@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up).
I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions.
So we'd have the following IBA plugins:

  1. pod plugin: returns a list of mounted PVCs
  2. pvc plugin: returns a list of bound PVs and (new addition to handle RWX/ROX) a list of pods that this PVC belongs to.

@sseago Let's say the IBA is executed in order of Pod1, Pod2, PVC1

  1. After IBA of Pod1, an itemBlock (ItemBlock1) is created with Pod1, PVC1, PV1
    Note that as described above, PVC1's IBA returns Pod1 and Pod2, meaning that ItemBlock1 will be created with Pod1, PVC1, PV1, Pod2
  1. After IBA of Pod2, some more additional items are returned PVC1, PV1. What are we going to do by this time? Will we create another itemBlock, or we merge Pod2 to the existing ItemBlock1? We may not merge them by this time right? Otherwise, we will not need IBA of PVC to return its associated pods?
    No merging needed, since Pod2 was already added to ItemBlock1 in the first place, so when the ItemBlock creation loop gets to Pod2, it's skipped since Pod2 already belongs to an ItemBlock
  2. After IBA for PVC1, from what you mentioned, the additional items are Pod1, Pod2, PV1, looks like we need to consolidate all the items together right? What algorithm do we use to do this?
    The alrgorithm as described in the design does this. We have the 2 IBAs as defined above:
  1. pod plugin: returns a list of mounted PVCs
  2. pvc plugin: returns a list of bound PVs and a list of pods that this PVC belongs to.
    Lets follow the logic defined in the design here:
  • We have a list of items from the item collector: Pod1, Pod2, PVC1, PV1
  • First loop iteration
    • Add Pod1 to ItemBlock1.
    • Call registered IBAs for Pod1: PVC1 returned.
    • Add PVC1 to ItemBlock1
    • Call registered IBAs for PVC1: PV1, Pod1, Pod2 returned.
    • Add PV1 to ItemBlock1 (no IBAs to run for PV1)
    • Pod1 already in an ItemBlock, nothing to do for it.
    • Add Pod2 to ItemBlock1
    • Call registered IBAs for Pod2: PVC1 returned
    • PVC1 already in an ItemBlock, nothing to do for it
  • Next loop iteration: Pod2 is already in an ItemBlock, continue
  • Next loop iteration: PVC1 is already in an ItemBlock, continue
  • Next loop iteration: PV1 is already in an ItemBlock, continue

So could you also add some details for how to create, maintain an itemBlock in various cases into the document?

It sounds like what's missing is the specific example of an RWX volume shared by multiple pods -- I can add this to the ItemBlock examples in the document -- but to be clear, the only thing necessary to completely handle this use case is an IBA for PVCs which return the list of mounting pods. I'll make sure the document makes that clear as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design for Velero Backup performance Improvements and VolumeGroupSnapshot enablement
9 participants