-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor compression & quality for difficult-to-compress data #189
Comments
Just a friendly reminder to look into this issue when you get a chance. |
Apologies for the delay. I have begun working on this and have reproduced the problem. I'm guessing there is some issue with the coefficient quantization. Looking into it now. In the meantime, I have a question about accuracy gain.
Under what circumstances "should" |
When R = 0, no compressed data is stored, so the best you can hope for is that the compression algorithm (with no knowledge about the input data) by luck reconstructs the decompressed data as a constant x = μ for all x, where μ is the mean of the input data. In this case, E = σ, so α = 0. Any other constant x will yield α < 0. For large R, either E = 0 or E converges to some small positive value, e.g., due to roundoff error in the compression scheme. Hence, α either approaches +∞ (E = 0) or -∞ (E > 0) as R grows. In between these limit behaviors, we would expect α to initially rise when R is small, indicating that the data is effectively being compressed, until it reaches a plateau, where α is constant. On this plateau, each increase in R by one bit/value results in a halving of E, so α remains constant. The value of this plateau is the number of per-value bits of information that the compressor has inferred and that need not be coded; it is a measure of how well the data has been compressed (higher is better). Each additional bit coded is at this point random and so cannot be compressed (one bit of output stored for each bit of input consumed). Eventually, α either goes to infinity or drops linearly. The plot below illustrates this behavior for some compressors (in particular, SPERR and zfp) using the rather smooth Miranda viscosity data from SDRBench, while other compressors do not reach a stable plateau (e.g., MGARD, SZ, TTHRESH). Perhaps MGARD would if it weren't for this apparent bug. We have diagnosed the TTHRESH bug and are working on a fix. Of course, for random data, we do not expect any bits to be inferred, so the plateau usually is at or below zero. Below zero because of overhead (e.g., metadata) and the pigeonhole principle; some outputs must be larger than the inputs in any compression scheme. That's what we're seeing for all compressors in the plot above. Think of accuracy gain as simply "rotating" (or, more accurately, "shearing") a PSNR vs. R plot so that the expected 6.02 dB/bit slope turns into a slope of zero. |
Hi @lindstro. Apologies again for how long this is taking. Fixing the Huffman code ended up entailing a lot of work; see #196. Running on that branch, the accuracy gain plot looks a bit better. Comparison with what you were seeing:
Here's a summary of how the compression happens:
As the error tolerance decreases, the quantized coefficients become larger, and as a result more of them fall outside the range I think we can say that this range parameter has some effect on the accuracy gain, but it's not the whole story. How's all this looking to you? |
Thanks, @ben-e-whitney, for all your work in addressing this issue. The accuracy gain plot looks a bit odd, but at least it doesn't go off a cliff, so that's a welcome improvement. Let me play with this a bit and see how MGARD does on a few data sets. Unfortunately, I cannot get the
|
It looks like my compiler is letting me get away with something a little improper. Once the branch is ready to be merged I'll have someone with a Mac try building and iron out any issues that come up. In the meantime, I think it'll compile with |
Thanks, but this leads to another issue:
Would be great if you could ensure MGARD builds on Mac. I've heard from others that there might be linker issues as well. |
@ben-e-whitney It's been some time since I saw progress on this issue or #186. Do you have any updates to share? |
@ben-e-whitney Any progress on this issue? |
Can someone on the MGARD team please comment on the status of this issue? It appears unresolved in the latest 1.3 release. |
I am doing some compression studies that involve difficult-to-compress (even incompressible) data. Consider the chaotic data generated by the logistic map xi+1 = 4 xi (1 - xi):
We wouldn't expect this data to compress at all, but the inherent randomness at least suggests a predictable relationship between L2 error, E, and rate, R. Let σ = 1/√8 denote the standard deviation of the input data and define the accuracy gain as
α = log₂(σ / E) - R.
Then each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.
Below is a plot of α(R) for MGARD 1.2.0 and other compressors applied to the above data interpreted as a 3D array of size 256 × 256 × 256. Here I used a smoothness parameter of 0, which should result in an L2 optimal reconstruction:
mgard compress --smoothness 0 --tolerance tolerance --datatype double --shape 256x256x256 --input input.bin --output output.mgard
. The tolerance was halved for each subsequent data point, starting with tolerance = 1.The plot suggests an odd relationship between R and α, where α is far from stable when R > 17. Is this perhaps a bug in MGARD? Similar behavior is observed for other difficult-to-compress data sets (see rballester/tthresh#7).
The text was updated successfully, but these errors were encountered: