thoughput and compression ratio of the high level API #187

hengjiew · 2022-03-17T16:11:42Z

Hello, I am testing the high level APIs on a V100 GPU (Summit) with a very simple benchmark. The input data is generated from random numbers between (0, 1). I got a few questions and it would be very helpful if you guys could shed some lights on them.

I got ~0.1GB/s for the comp and decomp thoughput. I am not sure what would a typical thoughput for mgard but does this seem low?
The API takes host/managed pointer. I guess the host-device copies (assumming comp/decomp happens on GPU) might lower the throughput. Is there a way to directly pass a device pointer and do all the work on GPU?
With the ABS error bound, if I set the tolerance below 1.0e-4, the data will not be compressed but inflated, i.e., compression ratio < 1.0. May I ask what causes this? Are there any lower bounds for the tolerance?

Below is the test I am using. Thank you so much!

#include <vector>
#include <iostream>
#include <random>
#include <limits>
#include "mgard/compress_x.hpp"

const double eps = std::numeric_limits<double>::epsilon();

int main()
{
  mgard_x::SIZE ni = 128;
  mgard_x::SIZE nj = 128;
  mgard_x::SIZE nk = 16;
  mgard_x::SIZE nCell = ni * nj * nk;
  std::vector<mgard_x::SIZE> shape({ni, nj, nk});

  std::random_device rd;
  std::default_random_engine eng(rd());
  std::uniform_real_distribution<double> gen(0.0, 1.0);

  double *arr_h = new double [nCell];
  for (int i=0; i<nCell; ++i) arr_h[i] = gen(eng);

  mgard_x::Config config;
  config.dev_type = mgard_x::device_type::CUDA;
  config.lossless = mgard_x::lossless_type::Huffman;
  config.uniform_coord_mode = 1;
  config.timing = true;

  void*  compArr = nullptr;
  size_t compSz;
  mgard_x::compress(3, mgard_x::data_type::Double, shape, 1.0e-6, 0.0,
                    mgard_x::error_bound_type::ABS, arr_h, compArr,
                    compSz, config, false);

  double ratio = (double)(nCell*sizeof(double)) / compSz;
  std::cout << "ratio " << ratio << "\n";

  void* decompArr;
  mgard_x::decompress(compArr, compSz, decompArr, config, false);

  double  maxabs = 0.0, avgabs = 0.0;
  double  maxrel = 0.0, avgrel = 0.0;
  //double* output = decompArr;
  for (int i=0; i<nCell; ++i) {
    double err = fabs(arr_h[i] - ((double*)decompArr)[i]);
    maxabs  = std::max(err, maxabs);
    avgabs += err;
    maxrel  = std::max(err/(fabs(arr_h[i])+eps), maxrel);
    avgrel += err / (fabs(arr_h[i]) + eps);
  }
  avgabs /= nCell;
  avgrel /= nCell;
  std::cout << "max abs err " << maxabs << " avg abs err " << avgabs << "\n";
  std::cout << "max rel err " << maxrel << " avg rel err " << avgrel << "\n";

  delete [] arr_h;                                                                                                                                                                               
  return 0;
}

The text was updated successfully, but these errors were encountered:

JieyangChen7 · 2022-03-18T04:45:28Z

@hengjiew Sorry about the late reply.
128*128*16*8 bytes (~2MB) is a small dataset that can be hard to fully saturate the GPU to achieve high throughput. Usually, you will need hundreds of megabytes of data to saturate the GPU for compression.
If you want to achieve the best performance and you don't need MGARD to handle metadata, I recommend that you directly use the low-level APIs, which can take device buffers as input. The current version has some issues when you directly call the low-level APIs, but it has been fixed in PR #188 . I have also added new examples for using low-level APIs. You can check out PR #188 and give it a try. Thanks!

hengjiew · 2022-03-18T18:08:45Z

@hengjiew Sorry about the late reply. 128*128*16*8 bytes (~2MB) is a small dataset that can be hard to fully saturate the GPU to achieve high throughput. Usually, you will need hundreds of megabytes of data to saturate the GPU for compression. If you want to achieve the best performance and you don't need MGARD to handle metadata, I recommend that you directly use the low-level APIs, which can take device buffers as input. The current version has some issues when you directly call the low-level APIs, but it has been fixed in PR #188 . I have also added new examples for using low-level APIs. You can check out PR #188 and give it a try. Thanks!

@JieyangChen7 Thanks for the reply. I will test with that PR. Besides this, is there any guidance about setting the error tolerance? When I set the tolerance below 1.0e-4, why does it stop compressing the data? Thanks!

JieyangChen7 · 2022-03-18T18:51:49Z

@hengjiew Besides storing the compressed data, the returned data buffer also stores necessary information for decompressing the data. In the GPU parallel implementation, that information can be as large as hundreds of KB to a few MB. So, when the input dataset is small, it is likely that the overhead for storing that information is high, which may limit the overall compression ratios. When the input data is large, such overhead is negligible.

ben-e-whitney · 2022-03-22T15:12:24Z

Another issue: since your data is pointwise random, there's very little 'compressible structure' for MGARD to take advantage of. The algorithm can't do much with noise. You should get a better compression ratio if your data is smoother. Try a random combination of sines and cosines.

for (std::size_t i = 0; i < ni; ++i) {
  const double x = static_cast<i> / ni;
  for (std::size_t j = 0; j < nj; ++j) {
    const double y = static_cast<j> / nj;
    for (std::size_t k = 0; k < nk; ++k) {
      const double z = static_cast<k> / nk;
      // Set `f` to be a function with some smoothness.
      arr_h[nj * i + nk * j + k] = f(x, y, z);
    }
  }
}

ben-e-whitney assigned JieyangChen7 Mar 18, 2022

ben-e-whitney added the question Further information is requested label Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thoughput and compression ratio of the high level API #187

thoughput and compression ratio of the high level API #187

hengjiew commented Mar 17, 2022

JieyangChen7 commented Mar 18, 2022

hengjiew commented Mar 18, 2022

JieyangChen7 commented Mar 18, 2022

ben-e-whitney commented Mar 22, 2022

thoughput and compression ratio of the high level API #187

thoughput and compression ratio of the high level API #187

Comments

hengjiew commented Mar 17, 2022

JieyangChen7 commented Mar 18, 2022

hengjiew commented Mar 18, 2022

JieyangChen7 commented Mar 18, 2022

ben-e-whitney commented Mar 22, 2022