Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thoughput and compression ratio of the high level API #187

Open
hengjiew opened this issue Mar 17, 2022 · 4 comments
Open

thoughput and compression ratio of the high level API #187

hengjiew opened this issue Mar 17, 2022 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@hengjiew
Copy link

Hello, I am testing the high level APIs on a V100 GPU (Summit) with a very simple benchmark. The input data is generated from random numbers between (0, 1). I got a few questions and it would be very helpful if you guys could shed some lights on them.

  1. I got ~0.1GB/s for the comp and decomp thoughput. I am not sure what would a typical thoughput for mgard but does this seem low?
  2. The API takes host/managed pointer. I guess the host-device copies (assumming comp/decomp happens on GPU) might lower the throughput. Is there a way to directly pass a device pointer and do all the work on GPU?
  3. With the ABS error bound, if I set the tolerance below 1.0e-4, the data will not be compressed but inflated, i.e., compression ratio < 1.0. May I ask what causes this? Are there any lower bounds for the tolerance?

Below is the test I am using. Thank you so much!

#include <vector>
#include <iostream>
#include <random>
#include <limits>
#include "mgard/compress_x.hpp"

const double eps = std::numeric_limits<double>::epsilon();

int main()
{
  mgard_x::SIZE ni = 128;
  mgard_x::SIZE nj = 128;
  mgard_x::SIZE nk = 16;
  mgard_x::SIZE nCell = ni * nj * nk;
  std::vector<mgard_x::SIZE> shape({ni, nj, nk});

  std::random_device rd;
  std::default_random_engine eng(rd());
  std::uniform_real_distribution<double> gen(0.0, 1.0);

  double *arr_h = new double [nCell];
  for (int i=0; i<nCell; ++i) arr_h[i] = gen(eng);

  mgard_x::Config config;
  config.dev_type = mgard_x::device_type::CUDA;
  config.lossless = mgard_x::lossless_type::Huffman;
  config.uniform_coord_mode = 1;
  config.timing = true;

  void*  compArr = nullptr;
  size_t compSz;
  mgard_x::compress(3, mgard_x::data_type::Double, shape, 1.0e-6, 0.0,
                    mgard_x::error_bound_type::ABS, arr_h, compArr,
                    compSz, config, false);

  double ratio = (double)(nCell*sizeof(double)) / compSz;
  std::cout << "ratio " << ratio << "\n";

  void* decompArr;
  mgard_x::decompress(compArr, compSz, decompArr, config, false);

  double  maxabs = 0.0, avgabs = 0.0;
  double  maxrel = 0.0, avgrel = 0.0;
  //double* output = decompArr;
  for (int i=0; i<nCell; ++i) {
    double err = fabs(arr_h[i] - ((double*)decompArr)[i]);
    maxabs  = std::max(err, maxabs);
    avgabs += err;
    maxrel  = std::max(err/(fabs(arr_h[i])+eps), maxrel);
    avgrel += err / (fabs(arr_h[i]) + eps);
  }
  avgabs /= nCell;
  avgrel /= nCell;
  std::cout << "max abs err " << maxabs << " avg abs err " << avgabs << "\n";
  std::cout << "max rel err " << maxrel << " avg rel err " << avgrel << "\n";

  delete [] arr_h;                                                                                                                                                                               
  return 0;
}
@ben-e-whitney ben-e-whitney added the question Further information is requested label Mar 18, 2022
@JieyangChen7
Copy link
Collaborator

@hengjiew Sorry about the late reply.
128*128*16*8 bytes (~2MB) is a small dataset that can be hard to fully saturate the GPU to achieve high throughput. Usually, you will need hundreds of megabytes of data to saturate the GPU for compression.
If you want to achieve the best performance and you don't need MGARD to handle metadata, I recommend that you directly use the low-level APIs, which can take device buffers as input. The current version has some issues when you directly call the low-level APIs, but it has been fixed in PR #188 . I have also added new examples for using low-level APIs. You can check out PR #188 and give it a try. Thanks!

@hengjiew
Copy link
Author

@hengjiew Sorry about the late reply. 128*128*16*8 bytes (~2MB) is a small dataset that can be hard to fully saturate the GPU to achieve high throughput. Usually, you will need hundreds of megabytes of data to saturate the GPU for compression. If you want to achieve the best performance and you don't need MGARD to handle metadata, I recommend that you directly use the low-level APIs, which can take device buffers as input. The current version has some issues when you directly call the low-level APIs, but it has been fixed in PR #188 . I have also added new examples for using low-level APIs. You can check out PR #188 and give it a try. Thanks!

@JieyangChen7 Thanks for the reply. I will test with that PR. Besides this, is there any guidance about setting the error tolerance? When I set the tolerance below 1.0e-4, why does it stop compressing the data? Thanks!

@JieyangChen7
Copy link
Collaborator

@hengjiew Besides storing the compressed data, the returned data buffer also stores necessary information for decompressing the data. In the GPU parallel implementation, that information can be as large as hundreds of KB to a few MB. So, when the input dataset is small, it is likely that the overhead for storing that information is high, which may limit the overall compression ratios. When the input data is large, such overhead is negligible.

@ben-e-whitney
Copy link
Collaborator

Another issue: since your data is pointwise random, there's very little 'compressible structure' for MGARD to take advantage of. The algorithm can't do much with noise. You should get a better compression ratio if your data is smoother. Try a random combination of sines and cosines.

for (std::size_t i = 0; i < ni; ++i) {
  const double x = static_cast<i> / ni;
  for (std::size_t j = 0; j < nj; ++j) {
    const double y = static_cast<j> / nj;
    for (std::size_t k = 0; k < nk; ++k) {
      const double z = static_cast<k> / nk;
      // Set `f` to be a function with some smoothness.
      arr_h[nj * i + nk * j + k] = f(x, y, z);
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants