Skip to content

Commit

Permalink
docs: update design docs
Browse files Browse the repository at this point in the history
  • Loading branch information
junkurihara committed Jul 3, 2023
1 parent 8d284a8 commit 670170f
Show file tree
Hide file tree
Showing 5 changed files with 77 additions and 61 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@ Cargo.lock

_*

obsolete/
.vscode/
obsolete
.vscode
12 changes: 6 additions & 6 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ categories = ["compression", "algorithms", "mathematics"]
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
anyhow = "1.0.57"
async-trait = "0.1.56"
bitvec = "1.0.0"
futures = "0.3.21"
hashlink = "0.8.0"
anyhow = "1.0.71"
async-trait = "0.1.69"
bitvec = "1.0.1"
futures = "0.3.28"
hashlink = "0.8.3"
libecc = { path = "src/libecc", version = "0.2.2" }
tokio = { version = "1.19.2", features = ["rt", "macros", "rt-multi-thread"] }
tokio = { version = "1.29.1", features = ["rt", "macros", "rt-multi-thread"] }

[dev-dependencies]
rand = "0.8.5"
Expand Down
110 changes: 63 additions & 47 deletions DESIGN.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,109 @@

# Design of GD based on Reed-Solomon codes over `GF(256)`
# Design of GD based on Reed-Solomon codes over $\mathrm{GF}(2^8)$

Unlike Hamming codes, RS codes are **non-perfect* codes, which means the simple *sphere packing* approach cannot be directly applied to employ GD. This implies that for an arbitrary given chunk `c = [c_1,...,c_n] in GF(256)^n`, the following does **NOT** always hold in an `(n, k)` RS code `C`:
Unlike Hamming codes, RS codes are **non-perfect* codes, which means the simple *sphere packing* approach cannot be directly applied to employ GD. This implies that for an arbitrary given chunk $c = [c_1,\dots,c_n] \in \mathrm{GF}(q)^n$, the following does **NOT** always hold in an $(n, k)$ RS code $\mathcal{C}$ [^rs].

```
|argmin { d_H(c, v) : v in C }| = 1 (necessary condition of perfect codes)
```
**The necessary condition of perfect codes:**

where `d_H(c, v)` is the Hamming distance between `c` and `v`. In other words, there may exist more than one codeword nearest to the chunk. If it is unique, an arbitrary chunk `c` can be uniquely converted to a tuple of (a codeword `cw` as a *base*, a virtual additive error `er` as *deviation*), i.e., `c = cw + er`. However, it doesn't hold for RS codes. Thus for RS codes, we need a certain rule to *forcibly and deterministically* map a given chunk to a codeword even if there exist two ore more candidates of such codeword.
$$\left|\argmin \{ d_H(c, v) : v \in \mathcal{C} \}\right| = 1,$$

where $d_H(c, v)$ is the Hamming distance between $c \in \mathrm{GF}(q)^n$ and $v\in \mathrm{GF}(q)^n$. In other words, there may exist more than one codeword nearest to the chunk. If it is unique, an arbitrary chunk $c\in \mathrm{GF}(q)^n$ can be uniquely converted to a tuple of (a codeword $w \in \mathcal{C}$ as a *base*, a virtual additive error $e \in \mathrm{GF}(q)^n$ as *deviation*), i.e., $c = w + e$. However, it doesn't always hold for RS codes. Thus for RS codes, we need a certain rule to *forcibly and deterministically* map a given chunk to a codeword even if there exist two ore more candidates of such codewords.

[^rs]: an $(n,k)$ linear code $\mathcal{C}$ over $\mathrm{GF}(q)$ means a linear subspace $C \subseteq \mathrm{GF}(q)^n$ of dimension $k$.

To this end, we take the following rule in our implementation.

---

## 1. Simple split of given chunks, and assumption on virtual error location

Let `c = [c_1,...c_n] in GF(256)^n` be a given data chunk, and `c_i` be a byte, i.e., an element over `GF(256)`. We assume `(n, k)` RS code `C` is employed.
Let $c = [c_1,...c_n] \in \mathrm{GF}(q)^n =\mathrm{GF}(2^8)^n$ be a given data chunk, and $c_i$ be a byte, i.e., an element over $\mathrm{GF}(2^8)$. We assume an $(n, k)$ RS code $\mathcal{C}$ is employed.

In our implementation, $c$ is first simply split into two subvectors: The left-most $k$ bytes

$$
c_l = [c_1,c_2,...c_k] \in \mathrm{GF}(2^8)^k,\\
$$

In our implementation, `c` is first simply split into two subvectors:
and the right-most $n-k$ bytes

```
cl = [c_1,c_2,...c_k] in GF(256)^k,
cr = [c_{k+1},...c_n] in GF(256)^{n-k},
```
$$
c_r = [c_{k+1},...c_n] \in \mathrm{GF}(2^8)^{n-k}.
$$

Here we regard `cr` as a part containing errors and `cl` as error-free. *we fix virtual error locations at the last `n-k` symbols of a given chunk*.
Here we regard $c_r$ as a part containing errors and $c_l$ as error-free. This means **we fix virtual error locations at the right-most $n-k$ symbols of the given chunk**.

## 2. Derivation of a corresponding codeword of `C` only from the first `k` symbols of a chunk `c`
## 2. Derivation of a corresponding codeword of $\mathcal{C}$ only from the first $k$ symbols of a chunk $c$

We see that `(n,k)` RS codes are maximum distance separable and hence a codeword can be reconstructed at least its error-free `k` symbols of any positions. Thus, as we fixed virtual error positions above, i.e., `cr`, we can identify `cl` as a codeword (i.e., base) `cw in C`.
We see that $(n,k)$ RS codes are maximum distance separable (MDS) and hence a codeword can be reconstructed at least its error-free $k$-out-of-$n$ symbols of any combination. Thus, as we fixed virtual error positions above, i.e., $c_r$, we can identify $c_l$ as a codeword (i.e., base) $w \in \mathcal{C}$. In other words, obviously, there exist an isomorphism $\phi$, $\mathrm{GF}(2^8)^k \rightarrow \mathcal{C}$, $\phi: c_r \mapsto w$.

In the deduplication process, we then obtain a unique codeword `cw` corresponding to `c` from `cl`, that is, we have `cw = G(cl)` for a certain bijective mapping `G: GF(256)^k -> C`. Here, we suppose this bijection is expressed by a *systematic generator matrix* `G = [I | P]` of `C`. Namely, we have a codeword `cw` as follows:
In the deduplication process, we then obtain the codeword $c_w$ uniquely corresponding to the chunk $c$ from $c_l$, that is, we have $w = \phi(c_l)$. Here, we suppose this bijection is expressed by a **systematic generator matrix** $G = \left[ \begin{array}{cc} I & P \end{array} \right] \in \mathrm{GF}(2^8)^{k \times n}$ of $\mathcal{C}$, where $I$ is a $k \times k$ identity matrix. Namely, we have a codeword `w` by the following calculation:

```
cw = cl G
= cl [I | P]
= [cl, cl P].
```
$$
w = c_l G
= c_l \left[ \begin{array}{cc} I & P \end{array} \right]
= \left[ \begin{array}{cc} c_l & c_l P \end{array} \right] \in \mathcal{C}.
$$

For the codeword `cw`, the deviation, i.e., virtual additive error, `er` is easily computed in such a way that `c = cw + er`.
Then, for the codeword $w$, the deviation, i.e., virtual additive error, $e$ is easily computed in such a way that $c = w + e$ is satisfied:

```
er = [0,...,0, cl P + cr],
```
$$
e = \left[0,\dots,0, c_r - c_l P \right].
$$

This means that the error part, i.e., deviation, is aligned to the right-most $n-k$ symbols of the chunk. Thus the deviation can be expressed as a vector of $n-k$ bytes.

## 3. Representation of base and deviation in deduplicated output and dictionary

In deduplicated data and dictionary in deduplication instances, the base `cw` and deviation `er` are expressed as `cl in GF(256)^k` and `cl P + cr in GF(256)^{n-k}`, respectively. This is because `cl` can be identified as `cw` as we mentioned. Also `cl P + cr` is identified as `er` from its special structure due to the systematic generator matrix and fixed positions of virtual errors.
In deduplicated data and dictionary in deduplication instances, the base $w$ and deviation $e$ are expressed as $c_l \in \mathrm{GF}(2^8)^k$ and $c_r - c_l P \in \mathrm{GF}(2^8)^{n-k}$, respectively. This is because $c_l$ can be identified as $w \in \mathcal{C}$ as we mentioned. Also $c_r - c_l P$ is identified as $e$ from its special structure due to the systematic generator matrix and fixed positions of virtual errors.

---

## Rationale and more reasonable approach for generic data based on *error-alignment*

Observe that in GD, we can arbitrarily assume *positions of virtual additive errors* in a given chunk to calculate the deviation. In the above method, we simply suppose that the message part `cl = [c_1,...,c_k]` of chunk `c` is error-free. Thus, we can uniquely fix the base, i.e., `cw = cl G ~ cl`, and the deviation `er = [0,...,0, cl P + cr] ~ cl P + cr` as well. Thus we can execute GD by applying this constraint.
Observe that in GD, we can arbitrarily assume *positions of virtual additive errors* in a given chunk to calculate the deviation. In the above method, we simply suppose that the message part $c_l = [c_1,...,c_k]$ of chunk $c$ is error-free. Thus, we can uniquely fix the base, i.e., $w = c_l G \sim c_l$, and the deviation $e = [0,...,0, c_r- c_l P] \sim c_r - c_l P$ as well. Thus we can execute GD by applying this constraint.

However, *since virtual additive errors are fluctuations in given data for its centered base, they would not always be contained at the last `n-k` symbols of an `n` byte chunk `c`* even if we carefully choose parameters `n` and `k` according to the given data type. Thus, in order to reasonably apply an `(n,k)` RS code for more generic data type in GD, **we should also configure the positions of virtual additive errors as well.**
However, *since virtual additive errors are fluctuations in given data for its centered base, they would not always be contained at the right-most $n-k$ symbols of an $n$-byte chunk $c$* even if we carefully choose parameters $n$ and $k$ according to the given data type/structure. Thus, in order to reasonably apply an $(n,k)$ RS code for more generic data type/structure in GD, **we should also configure the positions of virtual additive errors as well.**

To this end, we can also take an approach of the *"error alignment"* or *"pushing errors aside"* on given chunk by precoding data chunks (by applying `GD.align_error(m: Vec<Vec<u8>>)` method).
To this end, we can take an approach of the *"error alignment"* or *"pushing errors aside"* on given chunk by precoding data chunks (by applying `GD.align_error(m: Vec<Vec<u8>>)` method).

The very basic idea of error-alignment is given in the following paper in terms of *reordering high entropy bits*:

> Vestergaard, Rasmus, Daniel E. Lucani, and Qi Zhang. "Generalized deduplication: Lossless compression for large amounts of small IoT data." European Wireless 2019; 25th European Wireless Conference. VDE, 2019.
In our concept, the idea is a bit more generalized by employing *lienar transformation* instead of reordering (permutation). In particular for a specific data type, we first fix a linear transformation `T: GF(256)^n -> GF(256)^n`, i.e., a nonsingular `n x n` matrix `T` over `GF(256)`. Note that the simplest `T` is typically a simple permutation matrix to align error symbols to the last positions, as given in the above paper. We then execute the precoding on a given chunk `c` as follows.
In our concept, the idea is a bit more generalized by employing *lienar transformation* instead of reordering (permutation). In particular for a specific data type, we first fix a linear transformation $T: \mathrm{GF}(2^8) \rightarrow \mathrm{GF}(2^8)^n$, i.e., being multiplied a nonsingular $n x n$ matrix $T \in \mathrm{GF}(2^8)^{n \times n}$. Note that the simplest $T$ is typically a simple permutation matrix to align error symbols to the last positions, as given in the above paper. We then execute the precoding on a given chunk $c$ as follows.

$$
[x_l, x_r] = cT \in \mathrm{GF}(2^8)^n,
$$

```
[xl, xr] = cT,
```
where $x_l \in \mathrm{GF}(2^8)^k$ and $x_r \in \mathrm{GF}(2^8)^{n-k}$. Then, the base $w$ and the deviation $e$ are calculated on $[x_l, x_r]$ instead of $[c_l, c_r]$ by the above approach, as follows:

where `xl in GF(256)^k` and `xr in GF(256)^{n-k}`. Then, the base `cw` and the deviation `er` are calculated on `[xl, xr]` instead of `[cl, cr]` by the above approach, as follows:
$$
w = x_l G
= [x_l, x_l P],
$$

```
cw = xl G
= [xl, xl P],
er = [0,...,0, xl P + xr],
```
and

and `xl` is recoded as a base and `xl P + xr` is regarded as a deviation in a deduplicated data stream and the GD dictionary.
$$
e = [0,...,0, x_r - x_l P],
$$

We should note that *the linear transformation `T` pushes the virtual errors contained in `c` of the specific data form to the last `n-k` symbols of a transformed chunk of `n` symbols.*
and $x_l$ is recoded as a base and $x_r - x_l P$ is regarded as a deviation in a deduplicated data stream and the GD dictionary.

We should note that *the linear transformation $T$ pushes the virtual errors contained in $c$ of the specific data form to the right-most $n-k$ symbols of a transformed chunk of length $n$.*

The above operations are simply concatenated into the following:

```
T= [Tl | Tr]
cw = c Tl G = [c Tl, c Tl P]
er = [0,...,0, c Tl P + c Tr]
```
$$
T= \left[\begin{array}{cc} T_l & T_r \end{array} \right],\\
w = c T_l G = [c T_l, c T_l P],\\
e = [0,...,0, c T_r - c T_l P],
$$

where $T_l \in GF(2^8)^{n \times k}$ and $T_r \in GF(2^8)^{n \times {(n-k)}}$.

Since it is known that we need to properly configure virtual error positions to achieve better deduplication performance, code length and error-positions have been considered to be dynamically adjusted by splitting a chunk into subchunks forming a specific pattern of fluctuationss. In contrast, the error-alignment approach simply align errors in data chunk to the last positions, and a data chunk is processed by a single GD instance with single code parameter.

Anyways, **the most important factor to achieve better deduplication rate in GD is the estimation of fluctuation/virtual-error patterns contained in given data chunks**.
**The most important factor to achieve better deduplication rate in GD is the estimation of fluctuation/virtual-error patterns contained in given data chunks**.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2022 Jun Kurihara
Copyright (c) 2023 Jun Kurihara

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ use rust_gd::*;

**NOTE: The compression rate strongly depends on the data alignment and data structure. So you should carefully choose the parameters according to the characteristics of given data**.

### GD with Reed-Solomon code over GF(256)
### GD with Reed-Solomon code over $\mathrm{GF}(2^8)$

```rust:
use rust_gd::*;
Expand Down Expand Up @@ -64,7 +64,7 @@ println!("> Duped size {} bytes", y.len();
assert_eq!(duped, words);
```

In GD with RS codes, **error-alignment** can be employed by
In GD with RS codes, an approach of **error-alignment** can be employed by

```rust:
// Linear transformation matrix used for error-alignment. This must be nonsinglar.
Expand Down Expand Up @@ -115,11 +115,11 @@ println!("> Duped size {} bytes", y.len();

## Codes in our implementation

Currently, our implementation is based on Hamming code and Reed-Solomon (RS) code. GD based on RS codes processes data chunks as *byte stream*. On the other hand, Hamming-based GD considered as data chunks as *bit stream*.
Currently, our GD implementation is based only on Hamming and Reed-Solomon (RS) codes. The GD based on RS codes processes data chunks as *byte stream*. On the other hand, Hamming-based GD serves data chunks as *bit stream*.

For GD implementation based on Hamming codes, in the internal `libecc` library of error-correcting codes, Hamming code with `m = 3` works. However, the parameter of `m = 3` does not work in GD. This is because the code length, i.e., 7 bits, is not sufficient to deduplicate a "byte"-based data. In order to reasonably deduplicate byte-based data, *byte alignment* is needed. So, we omitted this small length parameter.
For GD implementation using Hamming codes, Hamming code with the degree $m = 3$ of the code works in the internal `libecc` library of error-correcting codes, i.e., a case of the code length $n = 2^m - 1 = 7$. However, the Hamming code of $m = 3$ cannot be employed as the underlying linear code of Hamming-based GD. This is because the code length, i.e., $n=7$ bits, is not sufficient to deduplicate a "byte"-based data. In order to reasonably deduplicate byte-based data, *byte alignment* is needed. So, we omitted $m = 3$ and considers the parameter $m \geq 4$.

**Byte alignment**: Our implementation employs an encoding method that chunks message sequences in the unit of bytes. For example, if `(15, 11)` Hamming code is employed, a 2-bytes message is divided into two one byte (= 8 bits) sequences, and pads 7 bits of zeros to each sequence to deal as 15-bits codeword of Hamming code.
**Byte alignment**: Our implementation employs an encoding method that chunks message sequences in the unit of bytes. For example, if $(15, 11)$ Hamming code is employed, a 2-byte message is divided into two one byte (= 8 bits) sequences, and pads $15-8=7$ bits of zeros to each sequence to deal as a 15-bit codeword of Hamming code.

## TODO

Expand Down

0 comments on commit 670170f

Please sign in to comment.