A collection of functions that illustrate several techniques useful in text compression.
Jerome is a library of string functions written in pure Python. The library's name is taken from St. Jerome of Stridon who is considered the patron saint of archivists.
- Zero dependencies1
- 100% test coverage
To get a local copy up and running follow these simple steps.
pip install jerome
For information about cloning and dev setup see: Contributing
Here is an example showing basic usage.
from datetime import datetime
from jerome import (SymbolKeeper, common, forward_bw, replacer, reverse_bw,
runlength_decode, runlength_encode)
from augustine_text.sample_text import words
# 75K words of procedurally generated text
# This is about the length of novel.
text = words(75000)
text_length = len(text)
compression_start = datetime.now()
# SymbolKeeper is used to portion out un-used symbols
k = SymbolKeeper(
reserved=set(list(text))
) # These appear in our text so we don't want to use them as placeholders
# common is a utility function for finding commonly occuring words
# We're using k from above to create a dictionary where each key is a word
# and the value is a single symbol replacement for that word
replacements = {word: next(k) for word in common(text, min_length=4)}
# {'dolore': '\x00', 'elit,': '\x02', 'labore': '\x03', ...
# Run replacements
replaced = replacer(text, replacements)
# Burrows Wheeler transform the text to improve runlength result
transformed = forward_bw(replaced)
# Runlength encode
runcoded = runlength_encode(transformed)
print(
f"""| step | result |
| ---- | ------ |
| Original Text size | {text_length} |
| With words replaced | {len(replaced)} |
| Encoded | {(rlen := len(runcoded))} |
| Reasonable length | {(dlen := len(str([(k,v) for k,v in replacements.items()])))} |
| Compressed size % | {round(((rlen+dlen)/text_length)*100, 2)} |
"""
)
compression_end = datetime.now()
# Reverse the whole thing
assert (unruncoded := runlength_decode(runcoded)) == transformed
assert (untransformed := reverse_bw(unruncoded)) == replaced
assert replacer(untransformed, replacements, reverse=True) == text
print(
f"| Compression time | {round((compression_end-compression_start).total_seconds() * 1000.0)} ms |"
)
print(
f"| Decompression time | {round((datetime.now()-compression_end).total_seconds() * 1000.0)} ms |"
)
print(
f"| Total time | {round((datetime.now()-compression_start).total_seconds() * 1000.0)} ms |"
)
step | result |
---|---|
Original Text length | 402270 |
With words replaced | 198228 |
Encoded | 74160 |
Reasonable length | 5724 |
Compressed size % | 19.86 |
Compression time | 1045 ms |
Decompression time | 172 ms |
Total time | 1217 ms |
NOTE Time was taken on a Ryzen 3600x @ 3.9Ghz.
This is only an example of how the text functions in Jerome work. Python's built in bz2 is both many times faster and many time better at compressing than the above example.
- In place BWT
- Huffman Coding
- Additional examples
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Add tests, we aim for 100% test coverage Using Coverage
- Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- Clone the repo and install
git clone https://github.com/kajuberdut/Jerome.git cd Jerome pipenv install --dev
- Run tests
pipenv shell py.test
For more about pipenv see: Pipenv Github
Distributed under the MIT License. See LICENSE
for more information.
Patrick Shechet - [email protected]
Project Link: https://github.com/kajuberdut/Jerome
Footnotes
-
Examples use augustine_text to generate material to compress ↩