Note: This repository holds the legacy cuFINUFFT codebase. Further development will take place in the FINUFFT repository. Please direct any issues or pull requests to that repository.
cuFINUFFT is a very efficient GPU implementation of the 1-, 2-, and 3-dimensional nonuniform FFT of types 1 and 2, in single and double precision, based on the CPU code FINUFFT.
cuFINUFFT introduces several algorithmic innovations, including load-balancing, bin-sorting for cache-aware access, and use of fast shared memory. Our tests show an acceleration over FINUFFT of up to 10x on modern hardware, and up to 100x faster than other established GPU NUFFT codes:
The linear transforms it can perform may be summarized as follows: type 1 maps nonuniform data (locations and corresponding strengths) to the uniformly spaced coefficients of a Fourier series (or its bi- or tri-variate generalization, according to dimension). Type 2 does the adjoint operation of type 1, ie maps in the reverse order. However, note that type 2 and type 1 are not generally each other's inverse, unlike for the FFT case! These transforms are performed to a user-presribed tolerance, at close-to-FFT speeds; under the hood, this involves detailed kernel design, custom spreading/interpolation stages, and plain FFTs performed by cuFFT. See the documentation for FINUFFT for a full mathematical description of the transforms and their applications to signal processing, imaging, and scientific computing.
Note: We are currently in the process of adapting the cuFINUFFT interface to closer match that of FINUFFT. This will likely break code depending on the current interface once the next release is published. At this point we will publish a migration guide that will detail the exact changes to the interfaces.
Main developer: Yu-hsuan Melody Shih (NYU). Main other contributors: Garrett Wright (Princeton), Joakim Andén (KTH/Flatiron), Johannes Blaschke (LBNL), Alex Barnett (Flatiron). See github for full list of contributors. This project came out of Melody's 2018 and 2019 summer internships at the Flatiron Institute, advised by CCM project leader Alex Barnett.
Note for most Python users, you may skip to the Python Package section first, and consider installing from source if that solution is not adequate for your needs. Note that 1D is not available in Python yet. Here's the C++ install process:
- Make sure you have the prerequisites: a C++ compiler (eg
g++
) and a recent CUDA installation (nvcc
). - Get the code:
git clone https://github.com/flatironinstitute/cufinufft.git
- Review the
Makefile
: If you need to customize build settings, create and edit amake.inc
. Example:- To override the standard CUDA
/usr/local/cuda
location yourmake.inc
should contain:CUDA_ROOT=/your/path/to/cuda
. - For examples, see one for IBM machines (
targets/make.inc.power9
), and another for the Courant Institute cluster (sites/make.inc.CIMS
).
- To override the standard CUDA
- Compile:
make all -j
(this takes several minutes) - Run test codes:
make check
which should complete in less than a minute without error. - You may then want to try individual test drivers, such as
bin/cufinufft2d1_test_32 2 1e3 1e3 1e7 1e-3
which tests the single-precision 2D type 1. Most such executables document their usage when called with no arguments.
Please see the codes in examples/
to see how to call cuFINUFFT
and link to from C++/CUDA, and to call from Python.
The default use of the cuFINUFFT API has four stages, that match those of the plan interface to FINUFFT (in turn modeled on those of, eg, FFTW or NFFT). Here they are from C++:
-
Plan one transform, or a set of transforms sharing nonuniform points, specifying overall dimension, numbers of Fourier modes, etc:
ier = cufinufft_makeplan(type, dim, nmodes, iflag, ntransf, tol, maxbatchsize, &plan, NULL);
-
Set the locations of nonuniform points from the arrays
x
,y
, and possiblyz
:ier = cufinufft_setpts(M, x, y, z, 0, NULL, NULL, NULL, plan);
(Note that here arguments 5-8 are reserved for future type 3 implementation, to match the FINUFFT interface).
-
Perform the transform(s) using these nonuniform point arrays, which reads strengths
c
and writes into modesfk
for type 1, or vice versa for type 2:ier = cufinufft_execute(c, fk, plan);
-
Destroy the plan (clean up):
ier = cufinufft_destroy(plan);
In each case the returned integer ier
is a status indicator.
Here is the full C++ documentation.
It is also possible to change advanced options by changing the last NULL
argument of the cufinufft_makeplan
call to a pointer
to an options struct, opts
.
This struct should first be initialized via
cufinufft_default_opts(type, dim, &opts);
before the user changes any fields.
For examples of this advanced usage, see test/cufinufft*.cu
It is up to the user to decide how exactly to link or otherwise install the libraries produced in lib
.
If you plan to use the Python wrapper you will minimally need to extend your LD_LIBRARY_PATH
,
such as with export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}
or a more permanent installation
path of your choosing.
If you would like to always have this installation in your library path, you can add to your shell rc with something like the following:
echo "\n# cufinufft librarypath \nexport LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}" >> ~/.bashrc
Because CUDA itself has similar library/path requirements, it is expected the user is somewhat familiar. If not, please ask, we might be able to help.
For those installing from source, this code comes with a Python wrapper module cufinufft
, which depends on pycuda
.
Once you have successfully installed and tested the CUDA library,
you may run make python
to manually install the additional Python package.
General Python users, or Python software packages which would like to automatically
depend on cufinufft using setuptools
may use a precompiled binary distribution.
This totally avoids installing from source and managing libraries for supported systems.
Binary distributions are specific to both hardware and software. We currently provide binary wheels targeting Linux systems covered by manylinux2010
for CUDA 10 forward with compatible GPUs. If you have such a system, you may run:
pip install cufinufft
For other cases, the Python wrapper should be able to be built from source.
If you want to test/benchmark the spreader and interpolator
(the performance-critical components of the NUFFT algorithm),
without building the whole library, do this with make checkspread
.
In general for make tasks,
it's possible to specify the target architecture using the target
variable, eg:
make target=power9 -j
By default, the makefile assumes the x86_64
architecture. We've included
site-specific configurations -- such as Cori at NERSC, or Summit at OLCF --
which can be accessed using the site
variable, eg:
make site=olcf_summit
The currently supported targets and sites are:
- Sites
- NERSC Cori (
site=nersc_cori
) - NERSC Cori GPU (
site=nersc_cgpu
) - OLCF Summit (
site=olcf_summit
) -- automatically setstarget=power9
- CIMS (
site=CIMS
) - Flatiron Institute, rusty cluster GPU node (
site=FI
)
- NERSC Cori (
- Targets
- Default (
x86_64
) -- do not specifytarget
variable - IBM
power9
(target=power9
)
- Default (
A general note about expanding the platform support: targets should contain
settings that are specific to a compiler/hardware architecture, whereas sites
should contain settings that are specific to a HPC facility's software
environment. The site
-specific script is loaded before the
target
-specific settings, hence it is possible to specify a target in a site
make.inc.*
(but not the other way around).
- TIME - timing for each stage. Enable by adding "-DTIME" to
NVCCFLAGS
. - SPREADTIME - more detailed timing from spreading and interpolation
- DEBUG - debug mode outputs all the middle stages' result
- If you are interested in optimizing for GPU Compute Capability,
you may want to specify
NVARCH=-arch=sm_XX
in your make.inc to reduce compile times, or for other performance reasons. See Matching SM Architectures.
- 1D version is close to finished (needs vectorized testers and Py interfaces)
- Type 3 transforms (which are quite tricky) as in FINUFFT are in progress (at least in 3D) on a PR, thanks to Simon Frasch; please go and test!
- We need some more tutorial examples in C++ and Python
- Please help us to write MATLAB (gpuArray) and Julia interfaces
- There are various Tensorflow and related interfaces in progress (please help with them or test them): https://github.com/mrphys/tensorflow-nufft https://github.com/dfm/jax-finufft
- Please see Issues and PRs for other things you can help fix or test
- cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs, Yu-hsuan Shih, Garrett Wright, Joakim Andén, Johannes Blaschke, Alex H. Barnett, PDSEC2021 conference (best paper prize). https://arxiv.org/abs/2102.08463