Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Explore running multiple containers in a shared VM #3658

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

eriknordmark
Copy link
Contributor

@eriknordmark eriknordmark commented Dec 9, 2023

For sidecar containers it would be useful to be able to run them in the same VM as the main container.
This is an experiment to see whether that can be done without any API changes by looking for multiple OCI-based volumes for a single app instance, and kicking off the EntryPoint for each one of them.

If this works it might be a useful stepping stone to get to more complete standard runtime for multiple containers in one VM.

With this PR I can create an app instance which has two OCI images (by example was the unmodified nginx and sshd containers from docker.io). The sshd and nginx run with chroot isolation and otherwise share everything, which matches the intended use case of closely cooperating and trust between a side car container and the main container.

Note that the commits in this PR needs to be cleaned up and squashed.

Copy link

codecov bot commented Dec 9, 2023

Codecov Report

Attention: 109 lines in your changes are missing coverage. Please review.

Comparison is base (3db87c6) 19.86% compared to head (6b09142) 19.87%.

Files Patch % Lines
pkg/pillar/hypervisor/xen.go 0.00% 45 Missing ⚠️
pkg/pillar/hypervisor/kvm.go 34.04% 31 Missing ⚠️
pkg/pillar/cmd/domainmgr/domainmgr.go 0.00% 26 Missing ⚠️
pkg/pillar/hypervisor/containerd.go 0.00% 6 Missing ⚠️
pkg/pillar/containerd/oci.go 90.90% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3658   +/-   ##
=======================================
  Coverage   19.86%   19.87%           
=======================================
  Files         231      231           
  Lines       51063    51160   +97     
=======================================
+ Hits        10143    10167   +24     
- Misses      40179    40253   +74     
+ Partials      741      740    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@deitch
Copy link
Contributor

deitch commented Dec 11, 2023

This is an experiment to see whether that can be done without any API changes by looking for multiple OCI-based volumes for a single app instance, and kicking off the EntryPoint for each one of them

How did you manage this? I see you switched from an assumption that type DomainStatus is a single container, with OCIConfigDir, and instead contains a list of containers ContainerList []DomainContainerStatus, each of which has OCIConfigDir and ContainerIndex and FileLocation.

What I don't get is:

  • how did you manage to populate that without changing the AppInstanceConfig which has a single VmConfig? I guess it does have repeated Drive, but you still would need the entrypoint for each, as well as other info?

@deitch
Copy link
Contributor

deitch commented Dec 11, 2023

Separately, what is the effort of this vs seeing if kata or k3s would be less effort?

@uncleDecart
Copy link
Contributor

uncleDecart commented Dec 11, 2023

Separately, what is the effort of this vs seeing if kata or k3s would be less effort?

@deitch from what I saw kata supports KVM, but I didn't see other type 1 hypervisor (like Xen) but even if we do thing that is only supported on KVM, kata creates VM instance with which it communicates via virtio. So we will have to have at least one VM to run containers.
Alternative which I also explored was microkernels like unikraft. In that case you will need to support this pipeline of creating specific VM and worry about porting libraries to it, when you need them. Fair point that for sidecar container we don't need much, but still we will have to maintain that. I believe that easiest way to have less footprint with running more containers will be to run them in one VM, rest are good options, but will require significantly more time to research and develop.

Edit: there's also microvms in KVM, which might reduce footprint

mount -t tmpfs -o nodev,nosuid,noexec,size=20% shm "$MNT"/rootfs/dev/shm
mount -t tmpfs -o nodev,nosuid,size=20% tmp "$MNT"/rootfs/tmp
mount -t mqueue -o nodev,nosuid,noexec none "$MNT"/rootfs/dev/mqueue
ln -s /proc/self/fd "$MNT"/rootfs/dev/fd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing all these descriptors with all containers wouldn't mess things up? Without a mux we will have mixed outputs on stdout, which might not be critical for now, but what about the stdin? Do we care about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case for this experiment is when some sharing is ok, but I'll look at the list and see what makes sense to separate.
If the direction in this PR is useful (with its limitations) it might be a useful stepping stone to running a collection of containers (a pod) in a VM using something existing like kata or k3s.

@deitch
Copy link
Contributor

deitch commented Dec 11, 2023

kata creates VM instance with which it communicates via virtio. So we will have to have at least one VM to run containers.

If we have no choice, then ok. I just am so wary of yet again creating something that looks a lot like some other OSS project or library, but we do just a little bit differently.

@github-actions github-actions bot requested a review from rene December 12, 2023 00:20
@eriknordmark
Copy link
Contributor Author

  • how did you manage to populate that without changing the AppInstanceConfig which has a single VmConfig? I guess it does have repeated Drive, but you still would need the entrypoint for each, as well as other info?

The current API allows specifying any number of virtual disks, whether they are OCI or images.
We currently don't do anything with the EntryPoint etc in anything but the first OCI image.
This experiment uses that.
So the single vmconfig still has to specify the CPU, memory, adapters, direct attach, etc.
But each OCI image has its own enviroment, user/group, etc.

@deitch
Copy link
Contributor

deitch commented Dec 12, 2023

The current API allows specifying any number of virtual disks, whether they are OCI or images.
We currently don't do anything with the EntryPoint etc in anything but the first OCI image.
This experiment uses that.
So the single vmconfig still has to specify the CPU, memory, adapters, direct attach, etc.
But each OCI image has its own enviroment, user/group, etc

That is how you did it. Now I see it inside. That is rather nicely done. It feels a bit swimming against the stream - Kubernetes has a native "multiple containers together (i.e. Pod)" concept - but we do as we need.

@eriknordmark
Copy link
Contributor Author

The current API allows specifying any number of virtual disks, whether they are OCI or images.
We currently don't do anything with the EntryPoint etc in anything but the first OCI image.
This experiment uses that.
So the single vmconfig still has to specify the CPU, memory, adapters, direct attach, etc.
But each OCI image has its own enviroment, user/group, etc

That is how you did it. Now I see it inside. That is rather nicely done. It feels a bit swimming against the stream - Kubernetes has a native "multiple containers together (i.e. Pod)" concept - but we do as we need.

From an implementation perspective I expect this to go away (together with the rest of the init-initrd scripts in pkg/xen-tools) once we find and integrate a standard runtime for all of this. And that should presumably give us the ability to specify e.g., volume and network resources for the containers inside the pod VM. So a stepping stone from a functional perspective, and a limited amount of throw-away code.

Add support for /mnt%d per OCI

Signed-off-by: eriknordmark <[email protected]>
Signed-off-by: eriknordmark <[email protected]>
Signed-off-by: eriknordmark <[email protected]>
Signed-off-by: eriknordmark <[email protected]>
Signed-off-by: eriknordmark <[email protected]>
Signed-off-by: eriknordmark <[email protected]>
Signed-off-by: eriknordmark <[email protected]>
Signed-off-by: eriknordmark <[email protected]>
@shjala
Copy link
Member

shjala commented Jan 29, 2024

I have some security concerns about running multiple apps in one VM without mapping each container to a separate user, and the fact that we are not setting up namespaces like mount and other. I'll write a longer comment soon.

@eriknordmark
Copy link
Contributor Author

I have some security concerns about running multiple apps in one VM without mapping each container to a separate user, and the fact that we are not setting up namespaces like mount and other. I'll write a longer comment soon.

@shjala
The assumption is that this will only be used by sidecar containers which do need full access hence no namespace isolation from the main container. I don't know if we can enforce that by checking the content of the container - need to look at the test case we used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants