LZ78 ASCII file compression
This implementation of LZ78 emulates the LZ78 algorithm (a variant of gzip's algorithm) and is for educational purposes only. Documentations are written throughout the methods, feel free to read them.
Clearly, this repository wasn't created for actual usage but as a for-fun project. DO NOT use this program on any files that are important, as it hasn't been tested enough yet. Do so at your own risk.
Simply put, the input file is first traslated into token (c, k) pairs, where c represents the 'new' character unknown at that time, and k the index to look up before outputting c. This allows repetitive phrases to be encoded efficiently as you only need the index k to look up the rest of the characters recursively.
Stopping here already allows efficient compression on non-trivial sized files, but encoding k in a minimum of 8 or 16 bits (1 or 2 char sized) is sometimes wasteful because k might be < 2^8 or 2^16. This calls for encoding in binary. This implies that the files can be further compressed using log2(max(k)) bits for each k.
Since Java doesn't support bit writing, a wrapper class acts as a buffer that keeps track of the current bit and periodically flushes the data into our binary file when enough bytes are accumulated. Similarly, the decoder wrapper class reads in bits to decode them back into our factors (c, k). The tokens are then re-translated back into sentences.
The older implementation of this project(v1.0.0) uses byte as the smallest unit. The code for it still exists within the repository for browsing.
If you're interested in running the program, you will require Java's JDK version 8.0 or later.
Check the releases page for binary downloads (Jar files).
If you're interested in compiling from sources, there's a gradle build file that automates that too.
That is,
- download gradle
- from the root directory, run
gradle build
For both compiled and downloaded binaries, the usage are as follows:
- Compressing:
java -jar LZ78-X.X.X.jar -c inputfile.[extension]
which produces a inputfile.[extension].lz78
file
- Decompressing:
java -jar LZ78-X.X.X.jar -x inputfile.[extension].lz78
which produces a inputfile.[extension]
file
- Deal with outliers (some files have special characters)
- Using a compressed Trie (instead of a HashMap)
- Translating to C++ (a C++ Trie is already implemented, waiting for the compression program to be written)