Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image Security Scanning RFC #136

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Image Security Scanning RFC #136

wants to merge 6 commits into from

Conversation

silenteh
Copy link
Contributor

This PR contains an RFC to discuss the implementation of Trow image (and OCI artifacts in general) security scanning.
Please use this pull request for requesting changes or in general give feedback regarding this specifc RFC

Copy link
Contributor

@amouat amouat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot. Thanks - you've put a lot of thought into this.

I wonder if we can simplify it a bit or at least break it into stages, so that it doesn't all have to happen at once.

For example, the first stage could have no storage or updating and just call out to the scanner each time a report is requested.

## Abstract
This document describes the introduction of container images scanning to Trow thus providing additional security
to a container based infrastructures.
Container image scanning is currently supported by all the major container registries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean the cloud provider registries like ECR etc? Wasn't it removed from Docker Hub? Certainly a common feature though.

Copy link
Contributor Author

@silenteh silenteh Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I meant cloud registries. Will adjust the wording



## 1. Objectives
* [ ] Security scanners integrate via the [Pluggable Image Vulnerability Scanning](https://github.com/goharbor/community/blob/master/proposals/pluggable-image-vulnerability-scanning_proposal.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change to be a clear goal i.e. Implement the bla proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point

Trow web UI to consume the vulnerabilities data
* [ ] If the security scanning fails or is not configured properly, Trow **runtime** functionalities are not impacted
* [ ] **Deployment** of Trow should not be impacted by the availability of image scanners like Clair or Trivy
* [ ] Secure by default. Whenever a security scanner is available all images are automatically scanned for vulnerabilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is intended here. If we add a security scanning service to an existing registry, it should immediately begin scanning all images?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to have Trow to be secure by default. In other words:

  • when a security scanner is available all images will be scanned (we don't add the possibility to skip the security scanning for specific images)
  • when a security scanner is available existing images should be immediately scanned and then the next run should happen based on the scheduling interval. This is how Harbor behaved 3 years ago :)
    As as an Admin/Ops I would expect that once the security scanner is added I don't have to do anything else to have my images scanned.
    This is of course my personal opinion. I think we need to check how heavy the security scanning is in order to have a better understanding of how Trow should behave once a new security scanner is added. It probably also depends on the amount of stored artifacts in Trow.
    We should also probably fix the tokio runtime creation for each route before allowing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense to me as long as it's just a default.

* [ ] If the security scanning fails or is not configured properly, Trow **runtime** functionalities are not impacted
* [ ] **Deployment** of Trow should not be impacted by the availability of image scanners like Clair or Trivy
* [ ] Secure by default. Whenever a security scanner is available all images are automatically scanned for vulnerabilities.
* [ ] New vulnerabilities are released every day. Trow should schedule image security scanning on a daily basis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is daily a common default?

The way security scanning normally works is with a first step that builds a list of packages, libs etc installed in an image, before figuring out which of those packages have known vulnerabilities. If you store the list of installed software and the images it relates to, you don't need to rescan - if a new vulnerability is found that affects one of the packages in the list you can automatically update the report. But that implies the scanner needs to keep track of package lists and is somehow able to update the reports...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to allow images to be scanned both regularly and also when they are pushed to the registry.
I definitely worded it badly. I wanted to say that Trow should support scanning of images based on a intervals and at push. Daily should be mentioned as an example only. We could use the cron format to specify the interval.

Regarding the security scanning process you mentioned, I personally think it is the responsibility of the security scanners (Trivy, Clair) to behave like that. I don't think this is something Trow should take care of. Instead Trow gets the result of the scan as a JSON file and just stores it as an artifact.
The Harbor Security Spec is nothing more than an HTTP API communication channel between a registry and a security scanner. The API calls available only return the full result of the scan, not that incremental ones, as far as I understand it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fine by me. Keep it simple if we can :)

* [ ] Allow to run a scan on demand by a user triggered action (api call and/or web ui)
* [ ] Expose a webhook to subscribe to notifications regarding vulnerable images
* [ ] API endpoint exposing information about vulnerable images MUST be protected by default by authentication and authorisation methods
* [ ] Trow admission controller MUST support blocking the pulling of images in case they have SEVERE vulnerabilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reword this one a bit? Perhaps something like:

"The Trow Admission Controller SHOULD add controls for blocking pulling of images with vulnerabilities. There MUST be support for configuring what level or number of vulnerabilities result in blocking and it MUST be configurable on an image or namespace level. "?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good point!

### a. Components
This is a breakdown of all components and their descriptions.

#### Trow Security Scanner registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - is this what Harbor does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can simplify this at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand what you are referring to. Github shows me that your comments are related to the #### Trow Security Scanner registry title. Were the comments meant for a different line ? If not then could you please clarify: is this what Harbor does ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the whole section :)

Could we just support a single scanner to begin with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah apologies, I did not get the fact you meant the whole section.
Harbor does have the notion of a security scanner registry and a controller so yes the architecture is similar.
I personally liked their architecture and I believe Harbor's devs must have put quite a lot of thoughts on it already.

The list of HTTP api entry points is a copy and paste of the current spec Harbor has defined.
I am not sure whether and how the Harbor spec will be updated.
So I wanted to make sure we have a stable reference to work against and as a documentation of the current implementation.
Inevitably at some point the two implementations will diverge. For instance when Harbor implements a new spec Trow will inevitably be behind with its own implementation. Or might even diverge...

I have added the ordered list of steps to implement this RFC as a comment to the whole PR.

We do not need a config manifest for the vulnerability scanning report, so the mediaType is set to: `vnd.unknown.config.v1+json`
See Azure container [documentation](https://docs.microsoft.com/en-us/azure/container-registry/container-registry-oci-artifacts)

* Step 3: Add an entry to the `index` file of the image which was scanned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work. I don't think all images will have indexes (we should check this). And what about images with multiple different tag names?

I think we have to be able to look up a report based on the image digest somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand either the role of index, but after re-reading it this morning, now I think I do.
The OCI Index spec is useful at the moment only for multi platform artifacts.
They explain it in the documentation like this:
The goal is to allow multi-architecture images, through a "fat manifest" which references image manifests for platform-specific versions of an image

So basically there is one index per image:tag. So for example let's assume we have one image tag nginx:1.19 which has images for both amd64 and arm.
The client pulls the index first and then pulls the specific OCI image via its digest, based on the platform the client runs on. (For arm it will pull the arm image, for amd64 the respective amd64 image)

{
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "size": 7143,
      "digest": "sha256:e692418e4cbaf90ca69d05a66403747baa33ee08806650b51fab815ad7fc331f",
      "platform": {
        "architecture": "arm",
        "os": "linux"
      }
    },
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "size": 7682,
      "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      }
    }
  ]
}

So the index in our case could be useful if the client wants to pull the security scanner reports for specific platforms. For example: pull the security scanner report for nginx:1.19 for arm.
I agree this is something we should not be focusing on at the moment.
This means also that I will need to adjust a bit how we store the image security report and instead of using the index, just use a manifest and the store the json as a file and place the manifest and the file inside the digest folder of the image/artifact.
So we can use the image digest to lookup the security report.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the OCI Index (previously called manifest list) a bit, as I played with multi-arch stuff (and we have a GitHub action to build a multi-arch image). I think it only exists for images that are multi-arch - most images won't have one.

I know it's an extra step, but I think I'd rather keep the reports in a separate directory structure - so just have a scans directory with multiple directories inside (each one named after a digest) which store the reports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about keeping it in a different root folder.
One thing however which might complicate things further down the path is when (and if) we add artifacts replication to Trow.
At that point we need to make sure we replicate both the artifacts (root folder) and the scan reports (root folder).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I think that's ok. I can see use cases where people want the images but not the reports, or even vice versa.

```

#### Trow WebHook
Once an image is found to have vulnerabilities we can allow pushing this information out to third party HTTP/s end points.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. Does Harbor do the same thing? What are the use cases?

I wonder if rather than send out the report it would be better just to say "this image has an updated vulnerability report" and leave it to them to go and pull the report if interested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of Harbor doing this, but I find it useful for simplifying third party integrations and in general to build SIEM systems.
We can either send the full report (which as long as the client implements the signature verification it's secure) or as you proposed just send a simpler JSON (this image has an updated vulnerability report).
The reason why I was suggesting to send the full report is that, as a developer, I prefer to keep the integration steps to a minimum. So basically if we send only the information regarding this image has an updated vulnerability report then as a developer I will need to:

  • verify the signature of the webhook (which requires a shared secret key between both the client and the server)
  • add API key to my http client to authenticate the next request I need to make to Trow API to pull the report
    So I end up with having to maintain 2 secrets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it might be easiest to send the whole report. I'm just thinking they might be quite large in some cases, but it's text that compresses well.

Regarding security, is it enough to say that if you can pull the image you can get the security report? After all they could run the security scan themselves.


#### Future potential improvements

The usage of HMAC for signing files could be an interesting security feature of Trow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By files, you mean the vulnerability reports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that we can sign both security reports and even artifacts.
HMAC has the property of making sure nobody has tempered with local files.
Here an example:

Currently we use sha256 to make sure that an image has not been tempered with. However if the client pulls images via TAG rather than image digests, as an attacker I could modify the image at the registry level and the recalculate the sha256 and the client would not be aware of the changes.
By signing locally every file with HMAC the attacker would need to gain access not only to the registry, but also to the secret key used to signed the files via HMAC.
Ideally the secret key lives either in Vault or an HSM or cliud HSM.
This is something we can explore later because it complicates things quite a bit. The good part is that we would guarantee stronger security for Trow without any OCI changes needed. It would just be an additional security feature of Trow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had some thoughts about a merkle-tree style approach. We could create a hash of all the digests in the registry, which gives us a master hash for the current state of the registry. I'm not really sure that helps much, but auditing and security is definitely one of the use-cases I want to explore with Trow.

#### Future potential improvements

The usage of HMAC for signing files could be an interesting security feature of Trow.
HMAC in fact guarantees that nobody has tempered with the local files, as long as the attacker gains access to the secret key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: doesn't gain access

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it. Will fix it

@amouat
Copy link
Contributor

amouat commented Aug 12, 2020

Also, is it common to have more than one scanning service? There's a fair bit of complexity associated with supporting mulitple services. Could we just skip and say you can register one for the time being?

@silenteh
Copy link
Contributor Author

silenteh commented Aug 13, 2020

We should definitely allow only 1 scanning service to begin with. The difference in terms of complexity to support more however is not that big.
I agree we should do it in stages.
However while writing down the steps I realized that if we allow only scanning of images with push, without storing the result then how can a Trow user check the scan report ? We should then implement the Webhook I suppose or an api.
Again the API needs to be queried, so we would need to store the reports somewhere...
So here my suggestion:

  1. Add the possibility to configure a scanner
  2. Scan images only when images are pushed without storage
  3. Add webhook
  4. Add storage
  5. Add Trow API to retrieve scan reports
  6. Allow scheduling of security scans

What do you think ?

@amouat
Copy link
Contributor

amouat commented Aug 13, 2020

We should definitely allow only 1 scanning service to begin with. The difference in terms of complexity to support more however is not that big.
I agree we should do it in stages.
However while writing down the steps I realized that if we allow only scanning of images with push, without storing the result then how can a Trow user check the scan report ? We should then implement the Webhook I suppose or an api.
Again the API needs to be queried, so we would need to store the reports somewhere...
So here my suggestion:

1. Add the possibility to configure a scanner

2. Scan images only when images are pushed without storage

3. Add webhook

4. Add storage

5. Add Trow API to retrieve scan reports

6. Allow scheduling of security scans

What do you think ?

What about only scanning when the report is requested? So a user would send something like:

GET /vuln-scans/1234def

Where 1234def is the digest of the image they are interested in. When Trow gets the request, it will kick off scanning the image and return a not ready yet response to the client. The client can then poll at a regular interval. When the scan completes, Trow can store it in temporary storage and return to the client on it's next request.

It might be a bit simpler for the first pass. If the scans are deleted, it doesn't matter as they will get re-created on request.

@amouat amouat changed the base branch from master to main April 21, 2021 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants