-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
meta issue: Improve Debian package reported license #103
Comments
See also #128 |
We have many levels of problems: 1. finding the copyright file of a package. There are case where we have the copyright file of a package which is a symlink to the copyright file of another package and we fail to get the copyright file in this case. 2. dealing with copyright formats 2.1 not machine-readable we do not partition files that are not machine-readable and this may impact license detection accuracy. There are several opportunities to improve this for instance with a heuristic that would split text regions in paragraph-like chunks based on the presence of some typical statements or even license rules such as: On Debian systems, the complete text of the GNU General Public Also in some almost structured files, we could split on lines starting with "License:" or "Copyright:" or "Copyright notice:" such as: 2.2 structured copyright files when we detect license on structured copyright files, we do not handle correctly the fact that a license is a known common license or not Only known common licenses symbols as used in the first line of a license declaration have a meaning. Other symbols (even when they look like an SPDX license id such as BSD-2-Clause) should be interpreted first based on the detection of the license text they point to a license paragraph. 3. incorrect license simplification We have incorrect license simplification that is applied on the detected license expressions. We should not apply simplification for now and rather fix it in the license_expression library. See aboutcode-org/license-expression#49 4. Inaccurate license detection proper We have incorrect license detection on multiple levels: 4.1 Incorrect mapping of common debian licenses We do not have correct mapping for known license symbols of common licenses when we are trying to detect a license as an expression. The set of these is limited in the ones found in `/usr/share/common-licenses/. For instance:
And also the symbols with a trailing + 4.2 we do not detect correctly some license expression syntax from the declared license For instance, this weird "academic free license >= 2.1, modified bsd license" where using a mapping in Though we may be able to apply heuristics where we could replace a comma by " AND " before parsing a license declaration line as an expression. Because of 4.2 and 4.3 we return way too many unknown licenses 4.3 we are missing license detection rules to detect accurately the licenses This is a matter of adding new license rules. 5. diagnose detection errors is hard We cannot easily diagnose and fix license detection issues because the details of the detection are not returned. For instance we cannot easily use scancode-analyzer to help spot and fix issues. |
This is the set of files found in a recent debian-unstable-slim Docker image. The expectations have been regenerated as-is but not yet revewied. See also: - aboutcode-org/scancode.io#128 - aboutcode-org/scancode.io#103 Signed-off-by: Philippe Ombredanne <[email protected]>
From aboutcode-org/scancode-toolkit#2518, detailing the improvements made in each level of the problem. On the specific issue reported in #128, we have the unstructured copyright file of The updated debian copyright system has a complete overhaul of the license detection and fixing of certain bugs which made possible the improvement here: Before Changes vs After Changes Now, thare are still minor inaccuracies here which are being fixed. On the progress made in the specific levels of issues discussed in this comment above: In 2. Dealing with copyright formats: 2.1 Machine Readable Copyrights: Status: Some critical bugs were fixed, this is now sent directly into scancode license detection as a whole, getting much better results. WIP: Break this file into parts of texts using common paragraphs seperators seen in debian copyright files, for better detection. 2.2 structured Copyright Files Status: Mostly done, now working on handling rare cases by running tests on dataset of collected debian copyright samples Debian -(320K from 2019-11) and Ubuntu (200K files from 2020-06). In debian copyright files, there are license paragraphs with license text and a license name after These licenses are then referenced in File and Header paragraphs in license expression like strings, and they reference to the license texts by their name. Now we fully parse these names and resolve the references to the license texts (instead of having a hand crafted mapping), even resolve unparsable expressions if these are also present as names of license texts. Also filters are added when reporting license detections to summarize detection based on Primary License Paragraph, Debian paragraphs and to only return unique license detections. Also the option to simplify would be added after aboutcode-org/license-expression#53 is merged and released. This significantly improves license detection in structured copyright files. In 3. incorrect license simplification Status: This is fixed at license-expression, in the process of being merged. In 4. Inaccurate License Detection proper 4.1 Common Licenses present in Status: These are now handled correctly. 4.2 we do not detect correctly some license expression syntax from the declared license Status: We now can parse the debian license expressions correctly, with cleaning and some specialized parsing of commas, according to the debian guidelines. Previously in debian_licenses.txt there was a mapping of all seen debian license expression present after Now, instead of having a mapping, these are now handled via cleaning up symbols which aren't supported by nexB/license-expression, and then parsing these as proper license expressions. 4.3 we are missing license detection rules to detect accurately the licenses Status: WIP, this has been made possible by making the license detection diagnosable [in 5. Diagnosing License Detection Problems ] New rules are added for common license detections, more rules are being added based on the added debian test files. Then even more rules can be added by running nexB/scancode-analyzer on more debian copyright files. In 5. Diagnosing License Detection Problems: Status: License detections are now fully diagnoseable. Previously, the license detections in a debian copyright file had as it's output only a license-expression string carrying all the detections, and hence it was hard to diagnose license detection problems, Now the license and copyright detection function returns a DebianDetector object with a list of LicenseDetection objects, which has the original LicenseMatch objects created by scancode LicenseDetection. This makes it possible to diagnose the root cause of license detection issues and also makes it possible to plug in the results from license detections in debian copyright files directly to https://github.com/nexB/scancode-analyzer for unique issue detection. in 1. Bug in symlinks Status: This is yet to be fixed. |
@AyanSinhaMahapatra @pombredanne anything else coming on this one or are we ready to close? |
@AyanSinhaMahapatra @pombredanne gentle ping, what's the latest status on this one? |
See these issues for details:
The text was updated successfully, but these errors were encountered: