Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor BPF recorder to enable simultaneous profile recording #2260

Merged
merged 3 commits into from
May 17, 2024

Conversation

mhils
Copy link
Contributor

@mhils mhils commented May 14, 2024

What type of PR is this?

/kind bug
/kind cleanup
/kind feature

What this PR does / why we need it:

This PR:

  1. Separates the BPF recorder from Seccomp and AppArmor logic and puts those parts into bpfrecorder_seccomp.go and bpfrecorder_apparmor.go respectively.
  2. Updates the BPF recorder to use LSM and tracepoint/sched hooks instead of hooking syscalls. This increases recorder coverage (for example, we previously missed some lockfiles, which caused incomplete profiles) and enables us to use bpf_d_path() to get absolute file paths in all cases. It also fixes tracking of child-child-processes. This makes the AppArmor recorder work flawlessly with several real-world workloads.
  3. Adds --type all for the recorder to simultaneously record AppArmor and Seccomp profiles.

Which issue(s) this PR fixes:

Fixes #2255

Does this PR have test?

Yes.

Special notes for your reviewer:

The good news is that we remove significantly more lines than we add, the bad news is that this PR is still quite large. This thing emerged from a rather iterative integration hell, let me know if we need to split it up further. :)

Does this PR introduce a user-facing change?

`spoc` can now record Seccomp and AppArmor profiles simultaneously.
The AppArmor recorder is now significantly more robust

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 14, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mhils. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 14, 2024
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 14, 2024
@saschagrunert
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2024
@codecov-commenter
Copy link

codecov-commenter commented May 14, 2024

Codecov Report

Attention: Patch coverage is 51.57068% with 185 lines in your changes are missing coverage. Please review.

Project coverage is 41.82%. Comparing base (11d77f4) to head (0706fb8).
Report is 212 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2260      +/-   ##
==========================================
- Coverage   45.50%   41.82%   -3.68%     
==========================================
  Files          79      109      +30     
  Lines        7782    15534    +7752     
==========================================
+ Hits         3541     6497    +2956     
- Misses       4099     8561    +4462     
- Partials      142      476     +334     

@mhils mhils force-pushed the integration-hell branch 3 times, most recently from 9873650 to 6b36a61 Compare May 14, 2024 12:12
Copy link
Contributor

@ccojocar ccojocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a great refactoring! I didn't spot any issues. 😄

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 14, 2024
@ccojocar
Copy link
Contributor

@mhils I think there are some lint error to fix.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2024
mhils added a commit to mhils/security-profiles-operator that referenced this pull request May 14, 2024
mhils added a commit to mhils/security-profiles-operator that referenced this pull request May 14, 2024
mhils added a commit to mhils/security-profiles-operator that referenced this pull request May 14, 2024
@mhils
Copy link
Contributor Author

mhils commented May 14, 2024

Thank you for the super quick review, @ccojocar - you folks are fantastic! 😃 🍰

CI has unfortunately uncovered a new issue I wasn't aware of before: Modern kernels all have BPF LSM support, but Debian/Ubuntu has not enabled it by default (see lockc-project/lockc#159 and https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2054810). This means the AppArmor recorder won't work on these systems properly as these hooks are silently ignored (Exhibit A: test / e2e-spoc fails in CI). Unfortunately there are no good workarounds. We do really want bpf_d_path to get absolute paths, the previous implementation was pretty broken. But the non-LSM hooks allowlisted to use bpf_d_path are not comprehensive enough to cover our needs. So we have to use LSM hooks for recording.

I've updated AppArmorRecorder.Load to error at runtime if we detect that LSM hooks are disabled. This unfortunately means we cannot run all the spoc tests in CI currently, but I've made it so that we still cover the AppArmor parts that do not rely on file paths.

mhils added a commit to mhils/security-profiles-operator that referenced this pull request May 14, 2024
mhils added a commit to mhils/security-profiles-operator that referenced this pull request May 14, 2024
@ccojocar
Copy link
Contributor

I've updated AppArmorRecorder.Load to error at runtime if we detect that LSM hooks are disabled. This unfortunately means we cannot run all the spoc tests in CI currently, but I've made it so that we still cover the AppArmor parts that do not rely on file paths.

Isn't possible to enable it into the integration tests for Ubuntu? Does it require to recompile the kernel with the LSM hook flag enabled? Not sure what kernel version is used by the integration tests, but it would be great to have a test for this to avoid future breakage. Maybe you can deploy a dedicated VM image.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2024
@saschagrunert
Copy link
Member

The profile merging test seems to fail now:

TestSuite/TestSecurityProfilesOperator/cluster-wide:_profile_merging

@mhils
Copy link
Contributor Author

mhils commented May 15, 2024

Isn't possible to enable it into the integration tests for Ubuntu? Does it require to recompile the kernel with the LSM hook flag enabled? Not sure what kernel version is used by the integration tests, but it would be great to have a test for this to avoid future breakage. Maybe you can deploy a dedicated VM image.

My understanding is that it requires a recent Kernel (which we have in CI), but also boot-time configuration which is not set on the GitHub-hosted runners. The libbpf folks are running custom kernels via QEMU to work around this. That would be an alternative, but also comes with quite a lot of CI infrastructure that we would need to maintained (they have their own repo just for the GitHub Actions - https://github.com/libbpf/ci/tree/main).

I have updated test/spoc/e2e_spoc_test.go so that we always test AppArmor socket and capability recording, even if BPF LSM support is missing on the host. This ensures that the AppArmor subsystem is working, which I think should cover most things that could possibly break. I feel it's probably ok to accept the risk for now. Eventually https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2054810 should be fixed and we get this fully covered in CI again without spending any effort on it. Does that sound reasonable?

Will take a look at the remaining CI failures later today. :)

mhils added a commit to mhils/security-profiles-operator that referenced this pull request May 15, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2024
@mhils
Copy link
Contributor Author

mhils commented May 15, 2024

CI failures fixed in 8d28fc6. 😃

mhils added a commit to mhils/security-profiles-operator that referenced this pull request May 15, 2024
@mhils
Copy link
Contributor Author

mhils commented May 16, 2024

I have GHA tests passing for the same code at mhils#4, so I assume the previous failures were just flaky tests. :)

@mhils mhils force-pushed the integration-hell branch 2 times, most recently from 7fa5b22 to 15b5954 Compare May 16, 2024 13:13
@ccojocar
Copy link
Contributor

My understanding is that it requires a recent Kernel (which we have in CI), but also boot-time configuration which is not set on the GitHub-hosted runners. The libbpf folks are running custom kernels via QEMU to work around this. That would be an alternative, but also comes with quite a lot of CI infrastructure that we would need to maintained (they have their own repo just for the GitHub Actions - https://github.com/libbpf/ci/tree/main).

We use the ubuntu-22.04 as an action image and on top of it runs vagrant with different images such as ubunut, fedora and flatcar where we run the e2e tests. This provides us some flexibility since the LSM needs to be enabled on vagrant image. I think we can add a dedicated e2e job for debian with LSM enabled (LSM should be enabled by default in Debian as I understand from ubuntu bug, or a custom ubuntu image with LSM enabled.

In case the e2e tests are taking too long, we can think to phase out the flatcar e2e test. I am not sure how much is used these days.

I would add this e2e test as a follow up pull request. I guess you can follow the same approach we have for fedora..

@saschagrunert What do you think?

Copy link
Contributor

@ccojocar ccojocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contribution!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 17, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ccojocar, mhils, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ccojocar,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 9a15b1e into kubernetes-sigs:main May 17, 2024
27 checks passed
k8s-ci-robot pushed a commit that referenced this pull request May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Allow spoc to simultaneously record different profile types.
5 participants