Skip to content

kajuberdut/jerome

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Jerome

A collection of functions that illustrate several techniques useful in text compression.

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About The Project

Jerome is a library of string functions written in pure Python. The library's name is taken from St. Jerome of Stridon who is considered the patron saint of archivists.

  • Zero dependencies1
  • 100% test coverage

Getting Started

To get a local copy up and running follow these simple steps.

Installing with pip

pip install jerome

For information about cloning and dev setup see: Contributing

Usage

Here is an example showing basic usage.

from datetime import datetime

from jerome import (SymbolKeeper, common, forward_bw, replacer, reverse_bw,
                    runlength_decode, runlength_encode)
from augustine_text.sample_text import words


# 75K words of procedurally generated text
# This is about the length of novel.
text = words(75000)
text_length = len(text)

compression_start = datetime.now()
# SymbolKeeper is used to portion out un-used symbols
k = SymbolKeeper(
    reserved=set(list(text))
)  # These appear in our text so we don't want to use them as placeholders

# common is a utility function for finding commonly occuring words
# We're using k from above to create a dictionary where each key is a word
#  and the value is a single symbol replacement for that word
replacements = {word: next(k) for word in common(text, min_length=4)}
# {'dolore': '\x00', 'elit,': '\x02', 'labore': '\x03', ...

# Run replacements
replaced = replacer(text, replacements)
# Burrows Wheeler transform the text to improve runlength result
transformed = forward_bw(replaced)
# Runlength encode
runcoded = runlength_encode(transformed)

print(
    f"""| step | result |
| ---- | ------ |
| Original Text size | {text_length} |
| With words replaced | {len(replaced)} |
| Encoded | {(rlen := len(runcoded))} |
| Reasonable length | {(dlen := len(str([(k,v) for k,v in replacements.items()])))} |
| Compressed size % | {round(((rlen+dlen)/text_length)*100, 2)} |
"""
)
compression_end = datetime.now()


# Reverse the whole thing
assert (unruncoded := runlength_decode(runcoded)) == transformed
assert (untransformed := reverse_bw(unruncoded)) == replaced
assert replacer(untransformed, replacements, reverse=True) == text
print(
    f"| Compression time |  {round((compression_end-compression_start).total_seconds() * 1000.0)} ms |"
)
print(
    f"| Decompression time |  {round((datetime.now()-compression_end).total_seconds() * 1000.0)} ms |"
)
print(
    f"| Total time |  {round((datetime.now()-compression_start).total_seconds() * 1000.0)} ms |"
)

Example compression of randomized text:

step result
Original Text length 402270
With words replaced 198228
Encoded 74160
Reasonable length 5724
Compressed size % 19.86
Compression time 1045 ms
Decompression time 172 ms
Total time 1217 ms

NOTE Time was taken on a Ryzen 3600x @ 3.9Ghz.

This is only an example of how the text functions in Jerome work. Python's built in bz2 is both many times faster and many time better at compressing than the above example.

Additional Documentation

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Add tests, we aim for 100% test coverage Using Coverage
  4. Commit your Changes (git commit -m 'Add some AmazingFeature')
  5. Push to the Branch (git push origin feature/AmazingFeature)
  6. Open a Pull Request

Cloning / Development setup

  1. Clone the repo and install
    git clone https://github.com/kajuberdut/Jerome.git
    cd Jerome
    pipenv install --dev
  2. Run tests
    pipenv shell
    py.test

For more about pipenv see: Pipenv Github

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Patrick Shechet - [email protected]

Project Link: https://github.com/kajuberdut/Jerome

Footnotes

  1. Examples use augustine_text to generate material to compress

About

Jerome is a compression library for novels

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages