Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger than memory images: performant and scalable distributed implementation for workstations and clusters #1062

Open
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

GFleishman
Copy link

This PR solves #1061 by adding distributed_segmentation.py, a self-contained module that provides the ability to segment larger-than-memory images on a workstation or cluster. Images are partitioned into overlapping blocks that are each processed separately, in parallel or in series (e.g. if you only have a single gpu). Per-block results are seamlessly stitched into a single segmentation of the entire larger-than-memory image.

Windows, Linux, and MacOS workstations, as well as LSF clusters, are automatically supported.
Other cluster managers, such as SLURM or SGE, require implementing your own dask cluster class, which is a good opportunity to submit a PR and be added to the author list. I am happy to advise anyone doing this.

The preferred input is a Zarr or N5 array, however folders full of tiff images are also supported. Single large tiff files can be converted to Zarr with the module itself. Your workstation or cluster can be arbitrarily partitioned into workers with arbitrary resources, e.g. "10 workers, 2 cpu cores each, 1 gpu each" or if you have a workstation with a single gpu, "1 worker with 8 cpu cores and 1 gpu." Computation never exceeds the given worker specification - so you can process huge datasets without occupying your entire machine.

Compatible with any Cellpose model. Small crops can be tested before committing to a big data segmentation by calling the function which runs on each individual block directly. A Foreground mask can be provided ensuring no time is wasted on voxels that do not contain sample. An arbitrary list of preprocessing steps can be distributed along with Cellpose itself, so if you need to smooth or sharpen or anything else before segmenting, you don't need to do it in advance and save a processed version of your large data - you can just distribute those preprocessing functions along with the segmentation.

Installation from scratch in a fresh conda environment tested successfully by @snoreis on a machine with the following specs:
OS: Windows 11 Pro
CPU: 16-core Threadripper PRO 3955WX
GPU: NVIDIA RTX A5000

Of course also tested in my own environments.
Workstation:
OS: Rocky Linux 9.3
CPU: 8-core Intel Sky Lake
GPU: 1x NVIDIA Tesla L4 15GB

Cluster:
OS: Rocky Linux 9.3
CPU: 100 cores Intel Sky Lake
GPU: 100x NVIDIA Tesla L4 15GB

List of functions provided, all have verbose docstrings covering all inputs and outputs:
distributed_eval : run cellpose on a big image on any machine
process_block : the function that is run on each block from a big dataset, can be called on its own for testing
numpy_array_to_zarr : create a zarr array, preferred input to distributed_eval
wrap_folder_of_tiffs : represent folder of tiff files as zarr array without duplicating data

New dependencies are correctly set and install successfully with source: pip install -e .[distributed]

Examples
Run distributed Cellpose on half the resources of a workstation with 16 cpus, 1 gpu, and 128GB system memory:

from cellpose.contrib.distributed_segmentation import distributed_eval

# parameterize cellpose however you like
model_kwargs = {'gpu':True, 'model_type':'cyto3'}
eval_kwargs = {'diameter':30,
               'z_axis':0,
               'channels':[0,0],
               'do_3D':True,
}

# define myLocalCluster parameters
cluster_kwargs = {
    'n_workers':1,    # we only have 1 gpu, so no need for more workers
    'ncpus':8,
    'memory_limit':'64GB',
    'threads_per_worker':1,
}

# run segmentation
# segments: zarr array containing labels
# boxes: list of bounding boxes around all labels
segments, boxes = distributed_eval(
    input_zarr=large_zarr_array,
    blocksize=(256, 256, 256),
    write_path='/where/zarr/array/containing/results/will/be/written.zarr',
    model_kwargs=model_kwargs,
    eval_kwargs=eval_kwargs,
    cluster_kwargs=cluster_kwargs,
)

Run distributed Cellpose on an LSF cluster with 128 GPUs (e.g. Janelia cluster)
(Note this example is identical to the previous one, with only a few small changes to the cluster_kwargs; i.e. it is easy to go back and forth between workstations and clusters.)

from cellpose.contrib.distributed_segmentation import distributed_eval

# parameterize cellpose however you like
model_kwargs = {'gpu':True, 'model_type':'cyto3'}
eval_kwargs = {'diameter':30,
               'z_axis':0,
               'channels':[0,0],
               'do_3D':True,
}

# define myLocalCluster parameters
cluster_kwargs = {
    'ncpus':2,                     # cpus per worker
    'min_workers':8,          # cluster auto allocates and releases workers based on number of blocks left to process
    'max_workers':128,
    'queue':'gpu_l4',
    'job_extra_directives':['-gpu "num=1"'],
}

# run segmentation
# segments: zarr array containing labels
# boxes: list of bounding boxes around all labels
segments, boxes = distributed_eval(
    input_zarr=large_zarr_array,
    blocksize=(256, 256, 256),
    write_path='/where/zarr/array/containing/results/will/be/written.zarr',
    model_kwargs=model_kwargs,
    eval_kwargs=eval_kwargs,
    cluster_kwargs=cluster_kwargs,
)

Testing a single block before running a distributed computation:

from cellpose.contrib.distributed_segmentation import process_block

# define a crop as the distributed function would
starts = (128, 128, 128)
blocksize = (256, 256, 256)
overlap = 60
crop = tuple(slice(s-overlap, s+b+overlap) for s, b in zip(starts, blocksize))

# call the segmentation
segments, boxes, box_ids = process_block(
    block_index=(0, 0, 0),  # when test_mode=True this is just a dummy value
    crop=crop,
    input_zarr=my_zarr_array,
    model_kwargs=model_kwargs,
    eval_kwargs=eval_kwargs,
    blocksize=blocksize,
    overlap=overlap,
    output_zarr=None,
    test_mode=True,
)

Wrap a folder of tiff images/tiles into a single Zarr array:

# Note tiff filenames must indicate the position of each file in the overall tile grid
from cellpose.contrib.distributed_segmentation import wrap_folder_of_tiffs
reconstructed_virtual_zarr_array = wrap_folder_of_tiffs(
    filname_pattern='/path/to/folder/of/*.tiff', block_index_pattern=r'_(Z)(\d+)(Y)(\d+)(X)(\d+)',
)

Converting a large single tiff image to Zarr:

# Note full image will be loaded in system memory
from cellpose.contrib.distributed_segmentation import numpy_array_to_zarr
data_numpy = tifffile.imread('/path/to/image.tiff')
data_zarr = numpy_array_to_zarr('/path/to/output.zarr', data_numpy, chunks=(256, 256, 256))
del data_numpy

fleishmang and others added 30 commits July 2, 2021 14:21
…tributed

update my local working branch with package changes, hopefully fixes logging
GFleishman added 24 commits May 21, 2024 21:07
…ogs. (1) io.logger_setup modified to accept alternative log file to stdout stream (2) distributed_eval creates datetime stamped log directory (3) individual workers create their own log files tagged with their name/index
…by default; no additional coding needed to leverage workstations with gpus
…tifffile - tifffile.imread(..., aszarr=True, ...) returns non-serializable array with single tiff input
…ust releases gpus and hard codes 1 cpu per worker - stitching is cheap, this will always fit
…cases in best way available given limitations of tiff files
This reverts commit 767b752.
…ytorch v2+ untested but may also require this change
@GFleishman
Copy link
Author

@krokicki

…version of pytorch; should be conditional and submitted in separate PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant