Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scanner): detection of files without curation #9487

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kikofernandez
Copy link

Based on #9435, it was stated that we could simply apply curations to files not listed in the license findings. The current PR is one such design but breaks backwards compatibility. I would like some feedback on how it would be preferred to deal with this.

Below I state some of the approaches I have thought of. Any feedback is welcomed.

PR Approach

  • detects files without license and these are not added to the output of the scanner (as requested in Include unlicensed files in scanner results #9435. As such, the evaluator reports errors for files without license. (breaks backwards compatibility)
  • if we add curations, curations are applied to the unlicensed files (this is simply because of the curation mechanism)
  • unlicensed files use the default value NONE when created as part of a LicenseFinding.
  • Not complete, missing tests, looking for feedback

Design Option

Approach
I have tried to look for ways to changes to the FindingCurationMatcher.kt (recommended by @sschuberth ). There are a bunch of methods in this file, but none of them have available the files needed from ScanResults. I do not mind to "patch" all call sites that pass findings:

  • FindingCurationMatcher().matches(finding, curation)
  • FindingCurationMatcher().apply(finding, curation)
  • FindingCurationMatcher().applyAll(finding, curation)

and add to the findings all files without licenses as LicenseFinding(license="NONE"), location=...). In this way, a curation matcher with a glob pattern on a folder and detected_license: NONE can apply the curation.

Questions/Comments
We apply curations to files not listed in the scanner results.
To me, this may seem counter-intuitive, since I do not think ORT has ever dealt with files not listed in the scanner. The idea is to apply curations to unlicensed files, but I think it is more uniform if unlicensed files are listed in the scanner. Thoughts?

Design Option 2

Add to the .ort.yml an option (or to the cli) to state that unlicensed files are part of the scanner and should be shown there.

Use case

  • Opt-in to curate unlicensed files. AFAIK, unlicensed files are not considered in the evaluator.
    From my point of view, we need to analyse Erlang/OTP and would like warnings/errors for any files without license. Design option 2 makes explicit this option, and the scanner shows the truth result of scanned files.

@fviernau
Copy link
Member

I believe we should first create a proposal how this could be done, before working on an actual implementation and decide for a specific approach.

Copy link

codecov bot commented Nov 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.02%. Comparing base (8bc47a4) to head (bdb6401).
Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #9487      +/-   ##
============================================
- Coverage     68.03%   68.02%   -0.01%     
- Complexity     1287     1288       +1     
============================================
  Files           249      249              
  Lines          8826     8830       +4     
  Branches        920      921       +1     
============================================
+ Hits           6005     6007       +2     
- Misses         2432     2433       +1     
- Partials        389      390       +1     
Flag Coverage Δ
funTest-docker 64.96% <ø> (ø)
test-ubuntu-24.04 35.81% <100.00%> (-0.01%) ⬇️
test-windows-2022 35.79% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kikofernandez
Copy link
Author

Question1: Should I create a proposal as a comment inside of #9435?
Question2: how does one usually create a proposal? (apart from this small PR with suggestions and asking for feedback on a new approach)?

In any case, the proposals that I have thought of are written in this PR:

  • Current PR proposal (breaks backwards compatibility)
  • Option 1
  • Option 2
  • Option 3? Different approach or more guidance?

Question3 Should I try to detail more the proposals written here? (I am new to the project and do not yet understand all possible outcomes/impact from the approaches outlined in this PR)

@sschuberth
Copy link
Member

The idea is to apply curations to unlicensed files, but I think it is more uniform if unlicensed files are listed in the scanner. Thoughts?

I'd still have a preference for not including unlicensed files / files without findings to the scan results to keep it smaller. Instead, as outlined elsewhere, I'd look into changing the license finding curation matcher logic to accept a detectedLicense of NONE in if path matches a file from the list list.

In a way, what you want is the opposite of "deleting" a finding by setting the concludedLicense to NONE, you want to "invent" findings.

Somehow these invented findings then need to make it into the license info resolver. I have not thought about how to do that yet.

@sschuberth
Copy link
Member

Question2: how does one usually create a proposal?

I good way would be to join our weekly community meeting to discuss such ideas.

@kikofernandez
Copy link
Author

Question2: how does one usually create a proposal?

I good way would be to join our weekly community meeting to discuss such ideas.

Oh, that would great! I would like to see what the community is doing, the roadmap, and learn. I will drop by for sure next Thursday!

@kikofernandez kikofernandez force-pushed the kiko/add-unlicense-files-for-curation/GH-9435 branch 5 times, most recently from 248cfa0 to fad230f Compare December 4, 2024 16:35
@kikofernandez
Copy link
Author

As discussed in the weekly meeting last week:

Contribution

  • To make the scanner able to display all files with their license findings, where files without licenses have license NONE.
  • Goal the scanner reports license findings, including files without license
  • Implication the curation mechanism can override files without licenses; this is important in many open-source projects, where a single license file applies to a whole folder / project (e.g., Elixir lang and Django among many others)

Summary
I have added a flag includeUnlicensed to the scanner configuration. Its default is false. When set to true, the scanner adds files without license to a ScanResult. These unlicensed files have their LicenseFindings set to NONE.

Missing

  • I am struggling to test this addition and could not find other tests for the scanner.scan, apart from integration tests. Any guidance on testing is welcomed :)

@kikofernandez kikofernandez marked this pull request as ready for review December 4, 2024 16:59
@kikofernandez kikofernandez requested a review from a team as a code owner December 4, 2024 16:59
@kikofernandez kikofernandez force-pushed the kiko/add-unlicense-files-for-curation/GH-9435 branch 2 times, most recently from 58b863b to 3d012e2 Compare December 5, 2024 08:35
@kikofernandez kikofernandez force-pushed the kiko/add-unlicense-files-for-curation/GH-9435 branch from 3d012e2 to 29f56ba Compare December 12, 2024 13:10
scanner/src/main/kotlin/Scanner.kt Fixed Show fixed Hide fixed
scanner/src/main/kotlin/Scanner.kt Fixed Show fixed Hide fixed
scanner/src/main/kotlin/Scanner.kt Fixed Show fixed Hide fixed
scanner/src/main/kotlin/Scanner.kt Fixed Show fixed Hide fixed
@kikofernandez kikofernandez force-pushed the kiko/add-unlicense-files-for-curation/GH-9435 branch from 29f56ba to a9fd947 Compare December 13, 2024 09:58
@kikofernandez
Copy link
Author

I have addressed detected problems: fixed the implicits and the commit lint :)

@fviernau
Copy link
Member

It was a real large effort to introduce file lists (complete listing of all the files) in a non-redundant way.
I strongly believe we should not introduce redundant data into any serialized model data structure with the implementation here, not even if it's behind a toggle. There is logic which maps the data on-the-fly, where this might fit in, but I may impact performance.

@kikofernandez
Copy link
Author

kikofernandez commented Dec 13, 2024

It was a real large effort to introduce file lists (complete listing of all the files) in a non-redundant way. I strongly believe we should not introduce redundant data into any serialized model data structure with the implementation here, not even if it's behind a toggle. There is logic which maps the data on-the-fly, where this might fit in, but I may impact performance.

@fviernau thanks for the feedback.
I think I do not fully understand the meaning of "we should not introduce redundant data into any serialized model data structure".

Could you provide some guidance to improve this PR with a more appropriate implementation?
Greatly appreciated
Thanks!

@kikofernandez kikofernandez force-pushed the kiko/add-unlicense-files-for-curation/GH-9435 branch 2 times, most recently from 3a0497b to bdb6401 Compare December 13, 2024 15:34
@kikofernandez kikofernandez changed the title detection of files without curation feat(scanner): detection of files without curation Dec 13, 2024
Add flag `includeUnlicensed` to the scanner configuration. Its default
is `false`. When set to `true`, the scanner add to a `ScanResult` files
without license as LicenseFindings with license set to `NONE`.

This contribution makes possible to the scanner to display all files as
license findings. The ultimate goal is that any file without license is
catched by the scanner, so that curation mechanism can override files
without licenses in cases where a license applies to a whole folder.

Signed-off-by: Kiko Fernandez-Reyes <kiko@erlang.org>
@kikofernandez kikofernandez force-pushed the kiko/add-unlicense-files-for-curation/GH-9435 branch from bdb6401 to 3fc2797 Compare December 19, 2024 14:03
/**
* A flag to indicate whether the scanner should add files without license to the scanner results.
*/
val includeUnlicensed: Boolean = false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my taste, this sounds too much like https://spdx.org/licenses/Unlicense.html. To avoid any confusion, I propose to rename this to includeFilesWithoutFindings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, property-ordering-wise I don't like this to go between the two "skip..." properties. Between skipExcluded and archive makes more sense to me.

@@ -207,6 +207,40 @@ class Scanner(
}
}

val filteredScanResults = filterScanResultsByVcsPaths(controller.getAllScanResults(), vcsPathsForProvenances)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, a few random starting thoughts / comments:

  • This code gets complex and deserves a few more code comments that explain the algorithm / intention.

  • For the default case ideally no unnecessary copy should be done (like is done now in line 217). To archive that, I'd leave val filteredScanResults as-is in original line 195, and instead introduce a new variable like

      val scanResults = if (scannerConfig.includeUnlicensed) {
          // Do the work.
      } else {
          filteredScanResults
      }
    
      // ....
    
      return ScannerRun.EMPTY.copy(
          config = scannerConfig,
          provenances = provenances,
          scanResults = scanResults,
          files = files,
          scanners = scanners
      )
    

Comment on lines +219 to +240
// Adds files without license to the scanned results
val scanSummary =
controller.getAllFileLists()[scanResult.provenance]?.files
.orEmpty().asSequence().mapNotNull { fileEntry ->
if (fileEntry.path in licenseFiles) {
null
} else {
fileEntry.path
}
}.toSet().let { fileEntryLicenses ->
(fileEntryLicenses subtract licenseFiles).mapTo(mutableSetOf()) { newFinding ->
LicenseFinding(license = "NONE", location = TextLocation(newFinding, 1))
}.let {
val allFindings = scanResult.summary.licenseFindings union it
scanResult.summary.copy(licenseFindings = allFindings)
}
}

scanResult.copy(
provenance = scanResult.provenance.alignRevisions(),
summary = scanSummary
)
Copy link
Member

@sschuberth sschuberth Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce nesting and make the code more readable, I propose something like this (completely untested; this goes to // Do the work. from my comment above):

        filteredScanResults.mapTo(mutableSetOf()) { scanResult ->
            val allPaths = controller.getAllFileLists()[scanResult.provenance]?.files?.map { it.path }.orEmpty()
            val pathsWithFindings = scanResult.summary.licenseFindings.map { it.location.path }
            val pathsWithoutFindings = allPaths - pathsWithFindings

            val findingsThatAreNone = pathsWithoutFindings.map {
                LicenseFinding(SpdxConstants.NONE, TextLocation(it, UNKNOWN_LINE))
            }

            scanResult.copy(
                summary = scanResult.summary.copy(
                    licenseFindings = scanResult.summary.licenseFindings + findingsThatAreNone
                )
            )
        }

Due to the introduced variables, this maybe is even clear enough to not require any code comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm using the SpdxConstants.NONE constant instead of "NONE" here, and deliberately an unknown / invalid line range. I have not checked whether this might cause any problems with some later sanity checks or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants