docs: update design docs

junkurihara · Jul 3, 2023 · 670170f · 670170f
1 parent 8d284a8
commit 670170f
Show file tree

Hide file tree

Showing 5 changed files with 77 additions and 61 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,5 +17,5 @@ Cargo.lock
 
 _*
 
-obsolete/
-.vscode/
+obsolete
+.vscode
diff --git a/Cargo.toml b/Cargo.toml
@@ -13,13 +13,13 @@ categories = ["compression", "algorithms", "mathematics"]
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
 
 [dependencies]
-anyhow = "1.0.57"
-async-trait = "0.1.56"
-bitvec = "1.0.0"
-futures = "0.3.21"
-hashlink = "0.8.0"
+anyhow = "1.0.71"
+async-trait = "0.1.69"
+bitvec = "1.0.1"
+futures = "0.3.28"
+hashlink = "0.8.3"
 libecc = { path = "src/libecc", version = "0.2.2" }
-tokio = { version = "1.19.2", features = ["rt", "macros", "rt-multi-thread"] }
+tokio = { version = "1.29.1", features = ["rt", "macros", "rt-multi-thread"] }
 
 [dev-dependencies]
 rand = "0.8.5"

diff --git a/DESIGN.md b/DESIGN.md
@@ -1,93 +1,109 @@
 
-# Design of GD based on Reed-Solomon codes over `GF(256)`
+# Design of GD based on Reed-Solomon codes over $\mathrm{GF}(2^8)$
 
-Unlike Hamming codes, RS codes are **non-perfect* codes, which means the simple *sphere packing* approach cannot be directly applied to employ GD. This implies that for an arbitrary given chunk `c = [c_1,...,c_n] in GF(256)^n`, the following does **NOT** always hold in an `(n, k)` RS code `C`:
+Unlike Hamming codes, RS codes are **non-perfect* codes, which means the simple *sphere packing* approach cannot be directly applied to employ GD. This implies that for an arbitrary given chunk $c = [c_1,\dots,c_n] \in \mathrm{GF}(q)^n$, the following does **NOT** always hold in an $(n, k)$ RS code $\mathcal{C}$ [^rs].
 
-```
-|argmin { d_H(c, v) : v in C }| = 1 (necessary condition of perfect codes)
-```
+**The necessary condition of perfect codes:**
 
-where `d_H(c, v)` is the Hamming distance between `c` and `v`. In other words, there may exist more than one codeword nearest to the chunk. If it is unique, an arbitrary chunk `c` can be uniquely converted to a tuple of (a codeword `cw` as a *base*, a virtual additive error `er` as *deviation*), i.e., `c = cw + er`. However, it doesn't hold for RS codes. Thus for RS codes, we need a certain rule to *forcibly and deterministically* map a given chunk to a codeword even if there exist two ore more candidates of such codeword.
+$$\left|\argmin \{ d_H(c, v) : v \in \mathcal{C} \}\right| = 1,$$
+
+where $d_H(c, v)$ is the Hamming distance between $c \in \mathrm{GF}(q)^n$ and $v\in \mathrm{GF}(q)^n$. In other words, there may exist more than one codeword nearest to the chunk. If it is unique, an arbitrary chunk $c\in \mathrm{GF}(q)^n$ can be uniquely converted to a tuple of (a codeword $w \in \mathcal{C}$ as a *base*, a virtual additive error $e \in \mathrm{GF}(q)^n$ as *deviation*), i.e., $c = w + e$. However, it doesn't always hold for RS codes. Thus for RS codes, we need a certain rule to *forcibly and deterministically* map a given chunk to a codeword even if there exist two ore more candidates of such codewords.
+
+[^rs]: an $(n,k)$ linear code $\mathcal{C}$ over $\mathrm{GF}(q)$ means a linear subspace $C \subseteq \mathrm{GF}(q)^n$ of dimension $k$.
 
 To this end, we take the following rule in our implementation.
 
 ---
 
 ## 1. Simple split of given chunks, and assumption on virtual error location
 
-Let `c = [c_1,...c_n] in GF(256)^n` be a given data chunk, and `c_i` be a byte, i.e., an element over `GF(256)`. We assume `(n, k)` RS code `C` is employed.
+Let $c = [c_1,...c_n] \in \mathrm{GF}(q)^n =\mathrm{GF}(2^8)^n$ be a given data chunk, and $c_i$ be a byte, i.e., an element over $\mathrm{GF}(2^8)$. We assume an $(n, k)$ RS code $\mathcal{C}$ is employed.
+
+In our implementation, $c$ is first simply split into two subvectors: The left-most $k$ bytes
+
+$$
+c_l = [c_1,c_2,...c_k] \in \mathrm{GF}(2^8)^k,\\
+$$
 
-In our implementation, `c` is first simply split into two subvectors:
+and the right-most $n-k$ bytes
 
-```
-cl = [c_1,c_2,...c_k] in GF(256)^k,
-cr = [c_{k+1},...c_n] in GF(256)^{n-k},
-```
+$$
+c_r = [c_{k+1},...c_n] \in \mathrm{GF}(2^8)^{n-k}.
+$$
 
-Here we regard `cr` as a part containing errors and `cl` as error-free. *we fix virtual error locations at the last `n-k` symbols of a given chunk*.
+Here we regard $c_r$ as a part containing errors and $c_l$ as error-free. This means **we fix virtual error locations at the right-most $n-k$ symbols of the given chunk**.
 
-## 2. Derivation of a corresponding codeword of `C` only from the first `k` symbols of a chunk `c`
+## 2. Derivation of a corresponding codeword of $\mathcal{C}$ only from the first $k$ symbols of a chunk $c$
 
-We see that `(n,k)` RS codes are maximum distance separable and hence a codeword can be reconstructed at least its error-free `k` symbols of any positions. Thus, as we fixed virtual error positions above, i.e., `cr`, we can identify `cl` as a codeword (i.e., base) `cw in C`.
+We see that $(n,k)$ RS codes are maximum distance separable (MDS) and hence a codeword can be reconstructed at least its error-free $k$-out-of-$n$ symbols of any combination. Thus, as we fixed virtual error positions above, i.e., $c_r$, we can identify $c_l$ as a codeword (i.e., base) $w \in \mathcal{C}$. In other words, obviously, there exist an isomorphism $\phi$, $\mathrm{GF}(2^8)^k \rightarrow \mathcal{C}$, $\phi: c_r \mapsto w$.
 
-In the deduplication process, we then obtain a unique codeword `cw` corresponding to `c` from `cl`, that is, we have `cw = G(cl)` for a certain bijective mapping `G: GF(256)^k -> C`. Here, we suppose this bijection is expressed by a *systematic generator matrix* `G = [I | P]` of `C`. Namely, we have a codeword `cw` as follows:
+In the deduplication process, we then obtain the codeword $c_w$ uniquely corresponding to the chunk $c$ from $c_l$, that is, we have $w = \phi(c_l)$. Here, we suppose this bijection is expressed by a **systematic generator matrix** $G = \left[ \begin{array}{cc} I & P \end{array} \right] \in \mathrm{GF}(2^8)^{k \times n}$ of $\mathcal{C}$, where $I$ is a $k \times k$ identity matrix. Namely, we have a codeword `w` by the following calculation:
 
-```
-cw = cl G
-   = cl [I | P]
-   = [cl, cl P].
-```
+$$
+w = c_l G
+   = c_l \left[ \begin{array}{cc} I & P \end{array} \right]
+   = \left[ \begin{array}{cc} c_l & c_l P \end{array} \right] \in \mathcal{C}.
+$$
 
-For the codeword `cw`, the deviation, i.e., virtual additive error, `er` is easily computed in such a way that `c = cw + er`.
+Then, for the codeword $w$, the deviation, i.e., virtual additive error, $e$ is easily computed in such a way that $c = w + e$ is satisfied:
 
-```
-er = [0,...,0, cl P + cr],
-```
+$$
+e = \left[0,\dots,0, c_r - c_l P \right].
+$$
+
+This means that the error part, i.e., deviation, is aligned to the right-most $n-k$ symbols of the chunk. Thus the deviation can be expressed as a vector of $n-k$ bytes.
 
 ## 3. Representation of base and deviation in deduplicated output and dictionary
 
-In deduplicated data and dictionary in deduplication instances, the base `cw` and deviation `er` are expressed as `cl in GF(256)^k` and `cl P + cr in GF(256)^{n-k}`, respectively. This is because `cl` can be identified as `cw` as we mentioned. Also `cl P + cr` is identified as `er` from its special structure due to the systematic generator matrix and fixed positions of virtual errors.
+In deduplicated data and dictionary in deduplication instances, the base $w$ and deviation $e$ are expressed as $c_l \in \mathrm{GF}(2^8)^k$ and $c_r - c_l P \in \mathrm{GF}(2^8)^{n-k}$, respectively. This is because $c_l$ can be identified as $w \in \mathcal{C}$ as we mentioned. Also $c_r - c_l P$ is identified as $e$ from its special structure due to the systematic generator matrix and fixed positions of virtual errors.
 
 ---
 
 ## Rationale and more reasonable approach for generic data based on *error-alignment*
 
-Observe that in GD, we can arbitrarily assume *positions of virtual additive errors* in a given chunk to calculate the deviation. In the above method, we simply suppose that the message part `cl = [c_1,...,c_k]` of chunk `c` is error-free. Thus, we can uniquely fix the base, i.e., `cw = cl G ~ cl`, and the deviation `er = [0,...,0, cl P + cr] ~ cl P + cr` as well. Thus we can execute GD by applying this constraint.
+Observe that in GD, we can arbitrarily assume *positions of virtual additive errors* in a given chunk to calculate the deviation. In the above method, we simply suppose that the message part $c_l = [c_1,...,c_k]$ of chunk $c$ is error-free. Thus, we can uniquely fix the base, i.e., $w = c_l G \sim c_l$, and the deviation $e = [0,...,0, c_r- c_l P] \sim c_r - c_l P$ as well. Thus we can execute GD by applying this constraint.
 
-However, *since virtual additive errors are fluctuations in given data for its centered base, they would not always be contained at the last `n-k` symbols of an `n` byte chunk `c`* even if we carefully choose parameters `n` and `k` according to the given data type. Thus, in order to reasonably apply an `(n,k)` RS code for more generic data type in GD, **we should also configure the positions of virtual additive errors as well.**
+However, *since virtual additive errors are fluctuations in given data for its centered base, they would not always be contained at the right-most $n-k$ symbols of an $n$-byte chunk $c$* even if we carefully choose parameters $n$ and $k$ according to the given data type/structure. Thus, in order to reasonably apply an $(n,k)$ RS code for more generic data type/structure in GD, **we should also configure the positions of virtual additive errors as well.**
 
-To this end, we can also take an approach of the *"error alignment"* or *"pushing errors aside"* on given chunk by precoding data chunks (by applying `GD.align_error(m: Vec<Vec<u8>>)` method).
+To this end, we can take an approach of the *"error alignment"* or *"pushing errors aside"* on given chunk by precoding data chunks (by applying `GD.align_error(m: Vec<Vec<u8>>)` method).
 
 The very basic idea of error-alignment is given in the following paper in terms of *reordering high entropy bits*:
 
 > Vestergaard, Rasmus, Daniel E. Lucani, and Qi Zhang. "Generalized deduplication: Lossless compression for large amounts of small IoT data." European Wireless 2019; 25th European Wireless Conference. VDE, 2019.
 
-In our concept, the idea is a bit more generalized by employing *lienar transformation* instead of reordering (permutation). In particular for a specific data type, we first fix a linear transformation `T: GF(256)^n -> GF(256)^n`, i.e., a nonsingular `n x n` matrix `T` over `GF(256)`. Note that the simplest `T` is typically a simple permutation matrix to align error symbols to the last positions, as given in the above paper. We then execute the precoding on a given chunk `c` as follows.
+In our concept, the idea is a bit more generalized by employing *lienar transformation* instead of reordering (permutation). In particular for a specific data type, we first fix a linear transformation $T: \mathrm{GF}(2^8) \rightarrow \mathrm{GF}(2^8)^n$, i.e., being multiplied a nonsingular $n x n$ matrix $T \in \mathrm{GF}(2^8)^{n \times n}$. Note that the simplest $T$ is typically a simple permutation matrix to align error symbols to the last positions, as given in the above paper. We then execute the precoding on a given chunk $c$ as follows.
+
+$$
+[x_l, x_r] = cT \in \mathrm{GF}(2^8)^n,
+$$
 
-```
-[xl, xr] = cT,
-```
+where $x_l \in \mathrm{GF}(2^8)^k$ and $x_r \in \mathrm{GF}(2^8)^{n-k}$. Then, the base $w$ and the deviation $e$ are calculated on $[x_l, x_r]$ instead of $[c_l, c_r]$ by the above approach, as follows:
 
-where `xl in GF(256)^k` and `xr in GF(256)^{n-k}`. Then, the base `cw` and the deviation `er` are calculated on `[xl, xr]` instead of `[cl, cr]` by the above approach, as follows:
+$$
+w = x_l G
+   = [x_l, x_l P],
+$$
 
-```
-cw = xl G
-   = [xl, xl P],
-er = [0,...,0, xl P + xr],
-```
+and
 
-and `xl` is recoded as a base and `xl P + xr` is regarded as a deviation in a deduplicated data stream and the GD dictionary.
+$$
+e = [0,...,0, x_r - x_l P],
+$$
 
-We should note that *the linear transformation `T` pushes the virtual errors contained in `c` of the specific data form to the last `n-k` symbols of a transformed chunk of `n` symbols.*
+and $x_l$ is recoded as a base and $x_r - x_l P$ is regarded as a deviation in a deduplicated data stream and the GD dictionary.
+
+We should note that *the linear transformation $T$ pushes the virtual errors contained in $c$ of the specific data form to the right-most $n-k$ symbols of a transformed chunk of length $n$.*
 
 The above operations are simply concatenated into the following:
 
-```
-T= [Tl | Tr]
-cw = c Tl G = [c Tl, c Tl P]
-er = [0,...,0, c Tl P + c Tr]
-```
+$$
+T= \left[\begin{array}{cc} T_l & T_r \end{array} \right],\\
+w = c T_l G = [c T_l, c T_l P],\\
+e = [0,...,0, c T_r - c T_l P],
+$$
+
+where $T_l \in GF(2^8)^{n \times k}$ and $T_r \in GF(2^8)^{n \times {(n-k)}}$.
 
 Since it is known that we need to properly configure virtual error positions to achieve better deduplication performance, code length and error-positions have been considered to be dynamically adjusted by splitting a chunk into subchunks forming a specific pattern of fluctuationss. In contrast, the error-alignment approach simply align errors in data chunk to the last positions, and a data chunk is processed by a single GD instance with single code parameter.
 
-Anyways, **the most important factor to achieve better deduplication rate in GD is the estimation of fluctuation/virtual-error patterns contained in given data chunks**.
+**The most important factor to achieve better deduplication rate in GD is the estimation of fluctuation/virtual-error patterns contained in given data chunks**.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2022 Jun Kurihara
+Copyright (c) 2023 Jun Kurihara
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -36,7 +36,7 @@ use rust_gd::*;
 
 **NOTE: The compression rate strongly depends on the data alignment and data structure. So you should carefully choose the parameters according to the characteristics of given data**.
 
-### GD with Reed-Solomon code over GF(256)
+### GD with Reed-Solomon code over $\mathrm{GF}(2^8)$
 
 ```rust:
 use rust_gd::*;
@@ -64,7 +64,7 @@ println!("> Duped size {} bytes", y.len();
 assert_eq!(duped, words);
 ```
 
-In GD with RS codes, **error-alignment** can be employed by
+In GD with RS codes, an approach of **error-alignment** can be employed by
 
 ```rust:
 // Linear transformation matrix used for error-alignment. This must be nonsinglar.
@@ -115,11 +115,11 @@ println!("> Duped size {} bytes", y.len();
 
 ## Codes in our implementation
 
-Currently, our implementation is based on Hamming code and Reed-Solomon (RS) code. GD based on RS codes processes data chunks as *byte stream*. On the other hand, Hamming-based GD considered as data chunks as *bit stream*.
+Currently, our GD implementation is based only on Hamming and Reed-Solomon (RS) codes. The GD based on RS codes processes data chunks as *byte stream*. On the other hand, Hamming-based GD serves data chunks as *bit stream*.
 
-For GD implementation based on Hamming codes, in the internal `libecc` library of error-correcting codes, Hamming code with `m = 3` works. However, the parameter of `m = 3` does not work in GD. This is because the code length, i.e., 7 bits, is not sufficient to deduplicate a "byte"-based data. In order to reasonably deduplicate byte-based data, *byte alignment* is needed. So, we omitted this small length parameter.
+For GD implementation using Hamming codes, Hamming code with the degree $m = 3$ of the code works in the internal `libecc` library of error-correcting codes, i.e., a case of the code length $n = 2^m - 1 = 7$. However, the Hamming code of $m = 3$ cannot be employed as the underlying linear code of Hamming-based GD. This is because the code length, i.e., $n=7$ bits, is not sufficient to deduplicate a "byte"-based data. In order to reasonably deduplicate byte-based data, *byte alignment* is needed. So, we omitted $m = 3$ and considers the parameter $m \geq 4$.
 
-**Byte alignment**: Our implementation employs an encoding method that chunks message sequences in the unit of bytes. For example, if `(15, 11)` Hamming code is employed, a 2-bytes message is divided into two one byte (= 8 bits) sequences, and pads 7 bits of zeros to each sequence to deal as 15-bits codeword of Hamming code.
+**Byte alignment**: Our implementation employs an encoding method that chunks message sequences in the unit of bytes. For example, if $(15, 11)$ Hamming code is employed, a 2-byte message is divided into two one byte (= 8 bits) sequences, and pads $15-8=7$ bits of zeros to each sequence to deal as a 15-bit codeword of Hamming code.
 
 ## TODO