Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve quality and tracing of license detection in Debian copyright files #2390

Open
pombredanne opened this issue Feb 9, 2021 · 6 comments

Comments

@pombredanne
Copy link
Member

pombredanne commented Feb 9, 2021

  1. we should be able to recover from mostly OK but not correct copyright files such as this one: https://metadata.ftp-master.debian.org/changelogs//main/p/pulseaudio/pulseaudio_14.2-1_copyright (this may be a ticket for the debian-inspector debut library though)
    See Recover parsing from almost machine-readable copyright files debian-inspector#6 Recover parsing from almost machine-readable copyright files

  2. we should have the ability to trace the intermediate detection results (see also Package license data structure: Improve tracing of license detection in package manifests #2389 ) for each paragraph of a copyright file

  3. we could establish a mapping of declared License "ids"

  4. there is an implicit notion of primary vs. secondary licenses in a copyright file and we should leverage this: a paragraph with "Files: *" applies to the package as a whole. This may mean a system-wide model change to track primary vs. secondary license or have the ability to track that in a license expression. See Determine the primary license from a copyright file debian-inspector#8 Determine the primary license from a copyright file

@pombredanne
Copy link
Member Author

From a chat with @chinyeungli

btw, just a note that these massive license_expression may contains irrelevant info such that some of the gpl-2.0 was detected because the copyright file states the debian packaging is under gpl-2.0 while the primary component may not contain any gpl code (For instance, https://changelogs.ubuntu.com/changelogs/pool/universe/s/signon/signon_8.59+17.10.20170606-0ubuntu1/copyright )

@pombredanne
Copy link
Member Author

pombredanne commented Feb 15, 2021

From a chat with @JonoYang based on scanning a Ubuntu-based Docker image in https://github.com/nexB/scancode.io/ that contained https://packages.ubuntu.com/bionic-updates/gcc-7

the package [email protected]~18.04 has the license expression of:

agpl-3.0 AND amd-historical AND artistic-2.0 AND bsd-new AND bsd-no-disclaimer AND bsd-no-disclaimer-unmodified AND bsd-original AND bsd-original-uc AND bsd-original-uc-1986 AND bsd-simplified AND bsla AND d-zlib AND delorie-historical AND flex-2.5 AND gfdl-1.2 AND gpl-1.0-plus AND gpl-2.0 AND gpl-2.0-plus AND gpl-3.0 AND gpl-3.0-plus AND gpl-3.0-plus WITH gcc-exception-3.1 AND hs-regexp AND intel-osl-1989 AND intel-osl-1993 AND lgpl-2.0 AND lgpl-2.0-plus AND lgpl-2.1 AND lgpl-2.1-plus AND lgpl-3.0-plus WITH cygwin-exception-lgpl-3.0-plus AND mit AND newlib-historical AND nilsson-historical AND osf-1990 AND other-copyleft AND other-permissive AND public-domain AND sunpro AND tex-exception AND uoi-ncsa AND viewflow-agpl-3.0-exception AND warranty-disclaimer AND wide-license AND wtfpl-1.0 AND x11-hanson AND x11-lucent AND zlib AND zlib-acknowledgement AND (commercial-license OR proprietary-license)

I'm not sure how the agpl-3.0 detection happened. I looked in scanpipe/scancode.io results for the Resources associated to the package gcc-7-base and I did not find any Resources attached to this package. I downloaded the copyright file for this package from ubuntu (http://changelogs.ubuntu.com/changelogs/pool/main/g/gcc-7/gcc-7_7.5.0-3ubuntu1~18.04/copyright), scanned it, and agpl-3.0 is not detected as a license.

@pombredanne
Copy link
Member Author

From a chat with @mjherzog based on scanning a Ubuntu-based Docker image in https://github.com/nexB/scancode.io/

We have a problem of license "proliferation" for some packages that we need to fix especially Debian system packages found in a Docker scan. One example is where we have the license expression:

agpl-3.0 AND agpl-3.0-plus AND bloomberg-blpapi AND gpl-1.0-plus AND gpl-2.0 AND gpl-2.0-plus AND lgpl-2.0-plus AND lgpl-2.1 AND lgpl-2.1-plus AND lgpl-3.0 AND mit AND other-permissive AND sun-rpc AND warranty-disclaimer

... for six files from pulseaudio (www.pulseaudio.org in Homepage URL).

I researched the Debian Copyright file from https://metadata.ftp-master.debian.org/changelogs//main/p/pulseaudio/pulseaudio_5.0-13_copyright and found:

  • Overall license is lgpl-2.1-plus (also what we have DejaCode) and most Copyright entries say: "License: LGPL-2.1+"
  • I also see bloomberg-blpapi, mit, sun-rpc and warranty-disclaimer plus one file under gpl-2.0-plus

My guess is that there may be some sort of license detection bug for the agpl and the other gpl and lgpl versions

pombredanne added a commit that referenced this issue Feb 24, 2021
pombredanne added a commit that referenced this issue Feb 24, 2021
We now test with and without dedup of licenses and copyrights.

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member Author

See aboutcode-org/scancode.io#103 (comment) for a detailed description of the problems

pombredanne added a commit that referenced this issue Apr 8, 2021
pombredanne added a commit that referenced this issue Apr 8, 2021
The current test for debian copyright files was wrong and misleading.
This corrects the problem by having proper values in plain expected
files and in detailed files.

There was also a problem of test name masking where both detailed and
non-detailed test methods had the same name and therefore were
not running correctly at all.

As a result all expected YAML files have been regenerated too.

Signed-off-by: Philippe Ombredanne <[email protected]>
pombredanne added a commit that referenced this issue Apr 8, 2021
This is the set of files found in a recent debian-unstable-slim Docker
image. The expectations have been regenerated as-is but not yet
revewied.

See also:
- aboutcode-org/scancode.io#128
- aboutcode-org/scancode.io#103

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member Author

To improve the tracing I think we could have this simple way:

  1. decouple entirely the processing of a whole copyright file data (and copyright statements) to have a function that deals only with the license detection and returns license matches.
  2. expose a new option in license detection such as --debian-copyright and some arg such as_debian_copyright that would treat *copyright files as if these were debian copyright files.

This way we can get regular license detection results from just copyright files irrespective of being in the cntext of a package or not.

@pombredanne
Copy link
Member Author

@AyanSinhaMahapatra FYI ^

AyanSinhaMahapatra added a commit that referenced this issue Apr 28, 2021
Refactor debian copyright detection to add DebianCopyrightDetector class,
makes changes to facilitate better copyright file parsing.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Apr 28, 2021
Fix bug in unstructured copyright file parsing, which always treated
copyright files as structured, and regenerate tests files.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Apr 28, 2021
Remove `unique` and `simplify_licenses` to have non-unique and
non-simplifies copyright and license information. Use with_debian_packaging
instead of using with_details and skip_debian_packaging.
Regenerates test for to update expectations.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Apr 28, 2021
Refactor and improve structured debian copyright file
parsing.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue May 28, 2021
Modify EnhancedDebianCopyright to be a DebianCopyright wrapper function
and modify flags used for filtering and reporting. Seperate structured
and unstructured parsing into different classes having the same base class
and main methods.
Also modify file to follow black standards.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue May 28, 2021
Updates get_installed_packages to directly call parse_copyright_file function
and get an object depending on structured/unstructured copyright file and
then call functions with filtering flags to get detections as required.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue May 28, 2021
Add tests for EnhancedDebianCopyright class and also modify test functions
to adopt the new API.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue May 28, 2021
This makes declared_license also report declared license in the license
paragraph of debian copyright files. Updates test expectations.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue May 28, 2021
Modify get_copyrights to have unique copyrights when the
unique_copyrights flag is set to True.

Refer to #2390

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue May 28, 2021
Regenerate test expectations after upgrading to latest debian-inspector
to parse paragraphs after double empty lines correctly, as the latest
version fixes this issue.

Refer to #2390
Refer to aboutcode-org/debian-inspector#17

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Jun 1, 2021
Instead of adding a general `unknown_debian_license` rule, create
a synthetic UnknownRule object and a LicenseMatch object out of the
unknown license text. Updates test expectations.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Jun 1, 2021
Instead of adding a general `unknown_debian_license` rule, create
a synthetic UnknownRule object and a LicenseMatch object out of the
unknown license text. Also updates test expectations after reindexing
licenses with new rules added from develop branch.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Jun 1, 2021
Instead of adding a general `unknown_debian_license` rule, create
a synthetic UnknownRule object and a LicenseMatch object out of the
unknown license text. Also updates test expectations after reindexing
licenses with new rules added from develop branch.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Jun 2, 2021
Update requirements and setup.cfg files to install the latest
debian-inspector version 21.5.25 to fix the following issue:
aboutcode-org/debian-inspector#17

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Jun 2, 2021
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
pombredanne pushed a commit that referenced this issue Jun 3, 2021
Instead of adding a general `unknown_debian_license` rule, create
a synthetic UnknownRule object and a LicenseMatch object out of the
unknown license text. Also updates test expectations after reindexing
licenses with new rules added from develop branch.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
pombredanne added a commit that referenced this issue Jun 3, 2021
pombredanne added a commit that referenced this issue Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant