Poor compression & quality for difficult-to-compress data #189

lindstro · 2022-03-25T17:07:22Z

I am doing some compression studies that involve difficult-to-compress (even incompressible) data. Consider the chaotic data generated by the logistic map x_i+1 = 4 x_i (1 - x_i):

#include <cstdio>

int main()
{
  double x = 1. / 3;
  for (int i = 0; i < 256 * 256 * 256; i++) {
    fwrite(&x, sizeof(x), 1, stdout);
    x = 4 * x * (1 - x);
  }
  return 0;
}

We wouldn't expect this data to compress at all, but the inherent randomness at least suggests a predictable relationship between L₂ error, E, and rate, R. Let σ = 1/√8 denote the standard deviation of the input data and define the accuracy gain as

α = log₂(σ / E) - R.

Then each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.

Below is a plot of α(R) for MGARD 1.2.0 and other compressors applied to the above data interpreted as a 3D array of size 256 × 256 × 256. Here I used a smoothness parameter of 0, which should result in an L₂ optimal reconstruction: mgard compress --smoothness 0 --tolerance tolerance --datatype double --shape 256x256x256 --input input.bin --output output.mgard. The tolerance was halved for each subsequent data point, starting with tolerance = 1.

The plot suggests an odd relationship between R and α, where α is far from stable when R > 17. Is this perhaps a bug in MGARD? Similar behavior is observed for other difficult-to-compress data sets (see rballester/tthresh#7).

The text was updated successfully, but these errors were encountered:

lindstro · 2022-04-15T15:17:02Z

Just a friendly reminder to look into this issue when you get a chance.

ben-e-whitney · 2022-04-27T14:29:13Z

Apologies for the delay. I have begun working on this and have reproduced the problem.

I'm guessing there is some issue with the coefficient quantization. Looking into it now.

In the meantime, I have a question about accuracy gain.

… the inherent randomness at least suggests a predictable relationship between L₂ error, E, and rate … each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.

Under what circumstances "should" α(R) be about constant with respect to R? Is that the best possible behavior, or will it be achieved by a compressor well-suited to chaotic data or something? I am unfamiliar with accuracy gain and am trying to determine the point at which I should be pointing the finger at the algorithm rather than its implementation.

lindstro · 2022-04-28T02:07:27Z

Under what circumstances "should" α(R) be about constant with respect to R? Is that the best possible behavior, or will it be achieved by a compressor well-suited to chaotic data or something? I am unfamiliar with accuracy gain and am trying to determine the point at which I should be pointing the finger at the algorithm rather than its implementation.

When R = 0, no compressed data is stored, so the best you can hope for is that the compression algorithm (with no knowledge about the input data) by luck reconstructs the decompressed data as a constant x = μ for all x, where μ is the mean of the input data. In this case, E = σ, so α = 0. Any other constant x will yield α < 0.

For large R, either E = 0 or E converges to some small positive value, e.g., due to roundoff error in the compression scheme. Hence, α either approaches +∞ (E = 0) or -∞ (E > 0) as R grows.

In between these limit behaviors, we would expect α to initially rise when R is small, indicating that the data is effectively being compressed, until it reaches a plateau, where α is constant. On this plateau, each increase in R by one bit/value results in a halving of E, so α remains constant. The value of this plateau is the number of per-value bits of information that the compressor has inferred and that need not be coded; it is a measure of how well the data has been compressed (higher is better). Each additional bit coded is at this point random and so cannot be compressed (one bit of output stored for each bit of input consumed). Eventually, α either goes to infinity or drops linearly.

The plot below illustrates this behavior for some compressors (in particular, SPERR and zfp) using the rather smooth Miranda viscosity data from SDRBench, while other compressors do not reach a stable plateau (e.g., MGARD, SZ, TTHRESH). Perhaps MGARD would if it weren't for this apparent bug. We have diagnosed the TTHRESH bug and are working on a fix.

Of course, for random data, we do not expect any bits to be inferred, so the plateau usually is at or below zero. Below zero because of overhead (e.g., metadata) and the pigeonhole principle; some outputs must be larger than the inputs in any compression scheme. That's what we're seeing for all compressors in the plot above.

Think of accuracy gain as simply "rotating" (or, more accurately, "shearing") a PSNR vs. R plot so that the expected 6.02 dB/bit slope turns into a slope of zero.

ben-e-whitney · 2022-05-18T14:11:40Z

@lindstro Just an update: I'm working on a bug in the coefficient quantization code. See #190. Once that's done, I'll return to this.

ben-e-whitney · 2022-07-05T16:22:08Z

Hi @lindstro. Apologies again for how long this is taking. Fixing the Huffman code ended up entailing a lot of work; see #196. Running on that branch, the accuracy gain plot looks a bit better.

Comparison with what you were seeing:

The initial plateau now extends to around 20 bits/double. In your tests, it only extended to around 17 bits/double.
α falls after the initial plateau and then flattens out again around 28 bits/value, in keeping with your tests. In your tests, α then cratered and MGARD wasn't able to achieve compression ratios below 2. Now, we have a second 'plateau' (not that flat) lasting until about 58 bits/double, beyond which MGARD's quantization routine fails because its output can't be stored in the designated integer type.
In the second plateau there's some strange degradation of α as the rate increases. It looks like the error tolerance decreases slightly and the compression ratio falls and/or the achieved error jumps. (I'm making the data for these plots by running MGARD with a sequence of error tolerances, while α is defined in terms of the compression ratio and the achieved error.) Then the tolerance decreases a bit more and things return to normal. I don't know what's going on here.

Here's a summary of how the compression happens:

After the decorrelation stage, the transform coefficients are quantized, yielding a sequence of integers.
A Huffman code is built for the integer sequence. A certain range of integers is eligible for a Huffman codeword. In the configuration used to produce the plot above, that range is {−2^17, …, 2^17 − 1}. Integers inside this range are represented in a 'hit' buffer by the corresponding codeword. Integers outside this range are instead stored verbatim in a 'missed' buffer.
The 'hit' buffer, the 'missed' buffer, and a bunch of metadata are written to a serialization buffer. This serialization buffer is then compressed by ZSTD.

As the error tolerance decreases, the quantized coefficients become larger, and as a result more of them fall outside the range {−2^17, …, 2^17 − 1} and end up in the 'missed' buffer. I guessed that the first plateau might end when this happened to some critical fraction of the quantized coefficients, so I repeated the experiment with some different integer ranges.

I think we can say that this range parameter has some effect on the accuracy gain, but it's not the whole story.

How's all this looking to you?

lindstro · 2022-07-05T19:45:48Z

Thanks, @ben-e-whitney, for all your work in addressing this issue. The accuracy gain plot looks a bit odd, but at least it doesn't go off a cliff, so that's a welcome improvement. Let me play with this a bit and see how MGARD does on a few data sets. Unfortunately, I cannot get the huffman-decoding branch to build on my Mac:

MGARD/include/huffman.tpp:183:14: fatal error: no viable conversion from 'std::shared_ptr<CodeCreationTreeNode>' to 'const bool'
  const bool children = node->left;
             ^          ~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/c++/v1/memory:2879:22: note: explicit conversion function is not a candidate
    _LIBCPP_EXPLICIT operator bool() const _NOEXCEPT {return get() != nullptr;}

ben-e-whitney · 2022-07-07T15:47:06Z

It looks like my compiler is letting me get away with something a little improper. Once the branch is ready to be merged I'll have someone with a Mac try building and iron out any issues that come up. In the meantime, I think it'll compile with static_cast<bool>(node->left) in place of node->left.

lindstro · 2022-07-09T14:51:18Z

Thanks, but this leads to another issue:

MGARD/src/lossless_dispatcher.cpp:69:3: fatal error: static_assert failed due to requirement 'std::is_same<long long, long>::value' "deprecated Huffman coding written with assumption that `std::int64_t` is `long int`"
  static_assert(std::is_same<std::int64_t, long int>::value,
  ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

Would be great if you could ensure MGARD builds on Mac. I've heard from others that there might be linker issues as well.

lindstro · 2022-07-18T22:57:40Z

Here's an update. I was able to get around the int64_t issue just by commenting the assertion out, which should be safe given that these types have the same size on my platform (Apple clang). The MGARD huffman-decoding branch now builds, albeit slowly (took 36 minutes!).

I'm including two plots below using the SDRBench Miranda viscosity field and this command line:

mgard-x -z -s 0 -m abs -e TOLERANCE -t d -n 3 256 384 384 -r 0 -l 3 -d serial -i viscosity.raw -c viscosity.mgard

where TOLERANCE = 2.62133 / 2ⁱ for nonnegative integers i. The first plot shows rate vs. accuracy gain. MGARD goes off the cliff, as before, but at a lower rate this time. The second plot shows that the absolute error tolerance is almost always exceeded.

By the way, the -r option is required but not documented. I guessed that a value of zero would be a good default.

lindstro · 2022-08-17T20:49:15Z

@ben-e-whitney It's been some time since I saw progress on this issue or #186. Do you have any updates to share?

lindstro · 2022-10-07T20:10:09Z

@ben-e-whitney Any progress on this issue?

lindstro · 2023-01-17T18:52:41Z

Can someone on the MGARD team please comment on the status of this issue? It appears unresolved in the latest 1.3 release.

JieyangChen7 assigned ben-e-whitney Apr 15, 2022

ben-e-whitney added bug Something isn't working question Further information is requested labels Apr 27, 2022

ben-e-whitney mentioned this issue Apr 28, 2022

Assertion Failure in huffman_decoding #190

Open

ben-e-whitney linked a pull request Jun 28, 2022 that will close this issue

Rewrite Huffman Coding Implementation #196

Draft

ben-e-whitney mentioned this issue Jul 11, 2022

fix data type in test to pass build #186

Open

lindstro mentioned this issue Oct 3, 2023

MGARD does not build with AppleClang #228

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor compression & quality for difficult-to-compress data #189

Poor compression & quality for difficult-to-compress data #189

lindstro commented Mar 25, 2022

lindstro commented Apr 15, 2022

ben-e-whitney commented Apr 27, 2022 •

edited

Loading

lindstro commented Apr 28, 2022

ben-e-whitney commented May 18, 2022

ben-e-whitney commented Jul 5, 2022

lindstro commented Jul 5, 2022

ben-e-whitney commented Jul 7, 2022

lindstro commented Jul 9, 2022

lindstro commented Jul 18, 2022

lindstro commented Aug 17, 2022

lindstro commented Oct 7, 2022

lindstro commented Jan 17, 2023

Poor compression & quality for difficult-to-compress data #189

Poor compression & quality for difficult-to-compress data #189

Comments

lindstro commented Mar 25, 2022

lindstro commented Apr 15, 2022

ben-e-whitney commented Apr 27, 2022 • edited Loading

lindstro commented Apr 28, 2022

ben-e-whitney commented May 18, 2022

ben-e-whitney commented Jul 5, 2022

lindstro commented Jul 5, 2022

ben-e-whitney commented Jul 7, 2022

lindstro commented Jul 9, 2022

lindstro commented Jul 18, 2022

lindstro commented Aug 17, 2022

lindstro commented Oct 7, 2022

lindstro commented Jan 17, 2023

ben-e-whitney commented Apr 27, 2022 •

edited

Loading