GitHub - kajuberdut/jerome: Jerome is a compression library for novels

Jerome

A collection of functions that illustrate several techniques useful in text compression.

About The Project

Jerome is a library of string functions written in pure Python. The library's name is taken from St. Jerome of Stridon who is considered the patron saint of archivists.

Zero dependencies¹
100% test coverage

Getting Started

To get a local copy up and running follow these simple steps.

Installing with pip

pip install jerome

For information about cloning and dev setup see: Contributing

Usage

Here is an example showing basic usage.

from datetime import datetime

from jerome import (SymbolKeeper, common, forward_bw, replacer, reverse_bw,
                    runlength_decode, runlength_encode)
from augustine_text.sample_text import words


# 75K words of procedurally generated text
# This is about the length of novel.
text = words(75000)
text_length = len(text)

compression_start = datetime.now()
# SymbolKeeper is used to portion out un-used symbols
k = SymbolKeeper(
    reserved=set(list(text))
)  # These appear in our text so we don't want to use them as placeholders

# common is a utility function for finding commonly occuring words
# We're using k from above to create a dictionary where each key is a word
#  and the value is a single symbol replacement for that word
replacements = {word: next(k) for word in common(text, min_length=4)}
# {'dolore': '\x00', 'elit,': '\x02', 'labore': '\x03', ...

# Run replacements
replaced = replacer(text, replacements)
# Burrows Wheeler transform the text to improve runlength result
transformed = forward_bw(replaced)
# Runlength encode
runcoded = runlength_encode(transformed)

print(
    f"""| step | result |
| ---- | ------ |
| Original Text size | {text_length} |
| With words replaced | {len(replaced)} |
| Encoded | {(rlen := len(runcoded))} |
| Reasonable length | {(dlen := len(str([(k,v) for k,v in replacements.items()])))} |
| Compressed size % | {round(((rlen+dlen)/text_length)*100, 2)} |
"""
)
compression_end = datetime.now()


# Reverse the whole thing
assert (unruncoded := runlength_decode(runcoded)) == transformed
assert (untransformed := reverse_bw(unruncoded)) == replaced
assert replacer(untransformed, replacements, reverse=True) == text
print(
    f"| Compression time |  {round((compression_end-compression_start).total_seconds() * 1000.0)} ms |"
)
print(
    f"| Decompression time |  {round((datetime.now()-compression_end).total_seconds() * 1000.0)} ms |"
)
print(
    f"| Total time |  {round((datetime.now()-compression_start).total_seconds() * 1000.0)} ms |"
)

Example compression of randomized text:

step	result
Original Text length	402270
With words replaced	198228
Encoded	74160
Reasonable length	5724
Compressed size %	19.86
Compression time	1045 ms
Decompression time	172 ms
Total time	1217 ms

NOTE Time was taken on a Ryzen 3600x @ 3.9Ghz.

This is only an example of how the text functions in Jerome work. Python's built in bz2 is both many times faster and many time better at compressing than the above example.

Additional Documentation

Burrows Wheeler Transform

Roadmap

In place BWT
Huffman Coding
Additional examples

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Add tests, we aim for 100% test coverage Using Coverage
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Cloning / Development setup

Clone the repo and install

git clone https://github.com/kajuberdut/Jerome.git
cd Jerome
pipenv install --dev

Run tests
```
pipenv shell
py.test
```

For more about pipenv see: Pipenv Github

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Patrick Shechet - [email protected]

Project Link: https://github.com/kajuberdut/Jerome

Examples use augustine_text to generate material to compress ↩

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
examples		examples
images		images
jerome		jerome
test		test
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jerome

Table of Contents

About The Project

Getting Started

Installing with pip

Usage

Example compression of randomized text:

Additional Documentation

Roadmap

Contributing

Cloning / Development setup

License

Contact

About

Releases

Packages

Languages

License

kajuberdut/jerome

Folders and files

Latest commit

History

Repository files navigation

Jerome

Table of Contents

About The Project

Getting Started

Installing with pip

Usage

Example compression of randomized text:

Additional Documentation

Roadmap

Contributing

Cloning / Development setup

License

Contact

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages