Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to validate GPL licenses #70

Open
ivanayov opened this issue Apr 20, 2022 · 4 comments
Open

Failing to validate GPL licenses #70

ivanayov opened this issue Apr 20, 2022 · 4 comments

Comments

@ivanayov
Copy link

licensing.validate() has 'Unknown license key(s)' error for GPL licenses, e.g. 'LGPLv2.1', 'GPLv2', 'GPL2'.

Side note: Also some images have several licenses, e.g. MIT, GPL2 and others.
When they are listed as 'MIT GPL2' for example, it's okay, validation just fails with errors ('Unknown license key(s)').
But when they are listed with commas instead - 'MIT,GPL2' it throws an exception for invalid characters.
In some cases, f.e. photon:3.0 the licenses come in this form.
The latter can be easily resolved, but I just wonder if it would be better those use-cases to be
handled within the validate method instead?

@pombredanne
Copy link
Member

pombredanne commented Apr 21, 2022

@ivanayov Thanks for the report. license-expression is not exactly a license detection library, you want to use ScanCode for this... or you feed it with known license symbols and how they map to SPDX or ScanCode license keys.
Let me illustrate this with a snippet:

>>> expression = 'LGPLv2.1 and GPLv2 or GPL2'

I can parse any expression with a valid syntax:

>>> Licensing().parse(expression)
OR(AND(LicenseSymbol('LGPLv2.1', is_exception=False), 
LicenseSymbol('GPLv2', is_exception=False)), 
LicenseSymbol('GPL2', is_exception=False))

But this expression will not validate as I did not specify what my license symbols are:

>>> Licensing().validate(expression)
ExpressionInfo(
    original_expression='LGPLv2.1 and GPLv2 or GPL2',
    normalized_expression=None,
    errors=['Unknown license key(s): LGPLv2.1, GPLv2, GPL2'],
    invalid_symbols=['LGPLv2.1', 'GPLv2', 'GPL2']
)

If I feed the Licensing with license symbols (here a simple list of strings), then things will validate alright:

>>> symbols = ['LGPLv2.1', 'GPLv2', 'GPL2']
>>> Licensing(symbols=symbols).parse(expression)
OR(AND(LicenseSymbol('LGPLv2.1', is_exception=False), 
LicenseSymbol('GPLv2', is_exception=False)), 
LicenseSymbol('GPL2', is_exception=False))
>>> Licensing(symbols=symbols).validate(expression)
ExpressionInfo(
    original_expression='LGPLv2.1 and GPLv2 or GPL2',
    normalized_expression='(LGPLv2.1 AND GPLv2) OR GPL2',
    errors=[],
    invalid_symbols=[]
)

and unknown symbols will be reported:

>>> Licensing(symbols=symbols).parse('GPL2 and foobar')
AND(LicenseSymbol('GPL2', is_exception=False), LicenseSymbol('foobar', is_exception=False))
>>> Licensing(symbols=symbols).validate('GPL2 and foobar')
ExpressionInfo(
    original_expression='GPL2 and foobar',
    normalized_expression=None,
    errors=['Unknown license key(s): foobar'],
    invalid_symbols=['foobar']
)

Based on your message above, I assume that you want to get proper detected and normalized license from RPM packages in VMware photon?

If so the the right solution would be a combo of:

  • establish a mapping of the individual license keys used in RPMs to ScanCode keys (which means also implicitly to SPDX keys)
  • normalize the RPM-side expression syntax to a valid expression syntax. For instance, replace a comma by an 'AND' or if there is no 'AND' or 'OR' in an original expression, then replace the spaces with an AND, or if there are symbols that contain spaces, normalized them not to contain spaces (though this library can handle symbols with spaces too)
  • create a licensing with the symbols mapping
  • do a first lightweight parsing of the expression and check for any unknown symbols with validate
  • if some symbols do not exists, run ScanCode license detection on them and replace the detected expression in the parsed expression, and validate again

Eventually this should be what https://github.com/nexB/scancode-toolkit/blob/4be4ba976d8d732538e72db97b311af39ca81432/src/packagedcode/rpm.py#L381 does and there is an attempt in aboutcode-org/scancode-toolkit#2894 to improve this by @adii21-Ux

The general case is in https://github.com/nexB/scancode-toolkit/blob/4be4ba976d8d732538e72db97b311af39ca81432/src/packagedcode/licensing.py#L109 and https://github.com/nexB/scancode-toolkit/blob/4be4ba976d8d732538e72db97b311af39ca81432/src/licensedcode/match_spdx_lid.py

The main issue to track RPM-related license detection is in aboutcode-org/scancode-toolkit#2412 "Improve license detection of declared RPM licenses"

So in conclusion, this is something that would benefit from some love... Can I interest you in helping make this work for RPM packages in general and photon packages in particular? If we do it in ScanCode, this would be available to everyone, including any tool that sues ScanCode (such as tern that may be of direct interest to you since you mentioned images above)

@pombredanne
Copy link
Member

@ivanayov gentle ping... did my explanation make sense?

@ivanayov
Copy link
Author

Thank you very much @pombredanne! It was very helpful and detailed explanation.
I'd need to do an estimate and then would be happy to help on the issue, depending on how long it would take.

@pombredanne
Copy link
Member

gentle ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants