Compress data files to save space #94

jamestwebber · 2024-05-10T15:20:03Z

This PR was motivated by a discussion about PEP 639 which might recommend using this package in build tools. In that context, package size is a big concern.

The package is about 1.2 MB installed, and the majority of that is due to scancode-licensedb-index.json. I just gzipped the data file and modified the code appropriately to save space--the json compresses to <10% of its original size and the tests all pass.

pombredanne

Thanks!
zip from wheels is pretty weak for compression indeed.
I wonder if we can get even better using lzma which is builtin since 3.3?
Also, I would prefer avoiding having the compressed json in Git if at all possible... to keep proper diffs and keep the repo as small as can be.

What about this:

modify the code to accept either the json or lzma compressed input
add the compressed version to .gitignore
update the build to use flot https://github.com/aboutcode-org/flot with a small prebuild script that will do the compression as part of the build

jamestwebber · 2024-05-10T15:52:00Z

That all sounds reasonable but I don't have time at the moment (this version was super easy 😅), I can try to make those changes next week, or someone else can take over.

pombredanne · 2024-05-10T16:07:06Z

Actually in the context of https://discuss.python.org/t/pep-639-round-3-improving-license-clarity-with-better-package-metadata/53020/1 I think we can do better.

We can build a minimal license-expression-mini wheel that would contain a subset of the license data ... say just the essential license keys in a list of tuples with no keys.

$ wget https://raw.githubusercontent.com/nexB/license-expression/c20b3f605daefc7cd9e4dc7b34e95280f206def3/src/license_expression/data/scancode-licensedb-index.json
$ ll
total 868
drwxrwxr-x  2 foobar foobar   4096 May 10 17:56 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ python
Python 3.10.13 (main, Jan  6 2024, 18:44:10) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> j=json.load(open("scancode-licensedb-index.json"))
>>> mini=[]
>>> for l in j:
...  l.pop("json")
...  l.pop("yaml")
...  l.pop("html")
...  l.pop("license")
...  mini.append(list(l.values()))
>>> with open("mini.json", "w") as o:
...  o.write(json.dumps(mini, separators=(',', ':'))
... 
... )
... 
>>> 
$ ll
total 1056
drwxrwxr-x  2 foobar foobar   4096 May 10 18:00 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ xz -z -k -9 mini.json 
$ ll
total 1080
drwxrwxr-x  2 foobar foobar   4096 May 10 18:01 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r--  1 foobar foobar  23704 May 10 18:00 mini.json.xz
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json

It would be down to 23K of compressed data :)
I still would want to use flot to generate multiple wheels from the same repo and keep the current wheel as-is.

gzipped data files

f5e9e6d

pombredanne requested changes May 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress data files to save space #94

Compress data files to save space #94

jamestwebber commented May 10, 2024

pombredanne left a comment

jamestwebber commented May 10, 2024

pombredanne commented May 10, 2024

Compress data files to save space #94

Are you sure you want to change the base?

Compress data files to save space #94

Conversation

jamestwebber commented May 10, 2024

pombredanne left a comment

Choose a reason for hiding this comment

jamestwebber commented May 10, 2024

pombredanne commented May 10, 2024