Reference
- Professional CUDA C Programming chapter 8
- Libraries
library | domain |
---|---|
cuFFT | FFT |
cuBLAS | BLAS 1,2,3 |
CULA | Linear Algebra |
cuSPARSE | Sparse Linear Algebra |
CUSP | Sparse Linear Algebra and Graph Computations |
cuRAND | Random Number Generation |
Thrust | Parallel Algo and Data Structure |
- 常见调用CUDA library过程
- create library handle
- contains contextual library information such as the format of data structures used, the devices used for computation, and other environmental data.
- you must allocate and initialize the handle before making any library calls.
- allocate device memory for input
- convert input to library-support format
- copy host data to device memory for library
- analogous to cudaMemcpy, though in many cases a library-specific function is used.
- when transferring a vector from the host to the device in a cuBLAS-based application, cublasSetVector should be used. 底层使用stride call to cudaMemcoy
- config library
- config through parameter
- config thorugh function handle
- executing
- retrieving result from device memory
- convert result back to original format
- release CUDA resoruce (handler)
- there is some overhead in allocating and releasing resources, so it is bet- ter to reuse resources across multiple invocations of CUDA library calls when possible.
cuSPARSE includes a range of general-purpose sparse linear algebra routines.
- Level 1 functions operate exclusively on dense and sparse vectors.
- Level 2 functions operate on sparse matrices and dense vectors.
- Level 3 functions operate on sparse matrices and dense matrices.
- function call
- data format
- data format conversion
- 常见注意
- ensuring proper matrix and vector formatting.
- 错误的format会导致segfault/validation error
- check conversion 是否成功
- Automated full dataset verification might be possible by performing the inverse format conversion back to the native data format, and verifying that the twice-converted values are equivalent to the original values. 通过正反两次conversion来验证
- scalar parameter是以reference pass in的
cuBLAS includes CUDA ports of all functions in the standard Basic Linear Algebra Subprograms (BLAS) library for Levels 1, 2, and 3.
For compatibility reasons, the cuBLAS library also chooses to use column-major storage.
- cuBLAS Level 1 contains vector-only operations like vector addition.
- cuBLAS Level 2 contains matrix-vector operations like matrix-vector multiplication.
- cuBLAS Level 3 contains matrix-matrix operations like matrix-multiplication.
两种API,legacy cuBLAS API is deprecated, 使用current cuBLAS API
- data transform
use custom cuBLAS routines such as cublasSetVector/cublasGetVector and cublasSetMatrix/cublasGetMatrix to transfer data between the host and device. . Although you can think of these specialized functions as wrappers around cudaMemcpy, they are well-optimized to transfer both strided and unstrided data. 使用cuBLAS特定的data transfer routine, 这些routine是被optimized的给传输数据
- 注意
- If you commonly use row-major programming languages, development with cuBLAS can require extra attention to detail.
cuFFT includes methods for performing fast Fourier transforms (FFTs) and their inverse.
An FFT is a transformation in signal processing that converts a signal from the time domain to the frequency domain. An inverse FFT does the opposite.
两个部分
- the core, high-performance cuFFT library
- the portability library, cuFFTW
cuFFTW is designed to maximize portability from existing code that uses FFTW. A wide range of the functions in the FFTW library are identi- cally supported in cuFFTW. In addition, the cuFFTW library assumes all inputs passed are in host memory and handles all of the allocation (cudaMalloc) and transfers (cudaMemcpy) for the user. Although this might lead to suboptimal performance, it greatly accelerates the porting process. cuFFTW是为了portablity,与FFTW的API和使用方法一致,但是有suboptimal performence
cuFFT的handler叫做plans
input output data type
- Complex to comple
- real to complex
- complex to real
cuRAND includes methods for rapid random number generation using the GPU.
支持两种random value generation
- pseudo-random RNGs (PRNG)
- A pseudo-random RNG (PRNG) uses an RNG algorithm to produce a sequence of random numbers where each value has an equal probability of being anywhere along the range of valid values for the storage type that RNG uses.
- When true randomness is required, a PRNG is a better choice
- quasi-random RNGs (QRNG)
- A QRNG makes an effort to fill the range of the output type evenly. Hence, if the last value sampled by a QRNG was 2, then P(2) for the next value has actually decreased. The samplings of a QRNG’s sequence are not statis- tically independent.
- QRNGs are useful in exploring spaces that are largely not understood. They guarantee a more even sampling of a multi-dimensional space than PRNGs but might also find fea- tures that a regular sampling interval will miss.