Larger than memory images: performant and scalable distributed implementation for workstations and clusters #1062

GFleishman · 2024-11-23T02:20:51Z

This PR solves #1061 by adding distributed_segmentation.py, a self-contained module that provides the ability to segment larger-than-memory images on a workstation or cluster. Images are partitioned into overlapping blocks that are each processed separately, in parallel or in series (e.g. if you only have a single gpu). Per-block results are seamlessly stitched into a single segmentation of the entire larger-than-memory image.

Windows, Linux, and MacOS workstations, as well as LSF clusters, are automatically supported.
Other cluster managers, such as SLURM or SGE, require implementing your own dask cluster class, which is a good opportunity to submit a PR and be added to the author list. I am happy to advise anyone doing this.

The preferred input is a Zarr or N5 array, however folders full of tiff images are also supported. Single large tiff files can be converted to Zarr with the module itself. Your workstation or cluster can be arbitrarily partitioned into workers with arbitrary resources, e.g. "10 workers, 2 cpu cores each, 1 gpu each" or if you have a workstation with a single gpu, "1 worker with 8 cpu cores and 1 gpu." Computation never exceeds the given worker specification - so you can process huge datasets without occupying your entire machine.

Compatible with any Cellpose model. Small crops can be tested before committing to a big data segmentation by calling the function which runs on each individual block directly. A Foreground mask can be provided ensuring no time is wasted on voxels that do not contain sample. An arbitrary list of preprocessing steps can be distributed along with Cellpose itself, so if you need to smooth or sharpen or anything else before segmenting, you don't need to do it in advance and save a processed version of your large data - you can just distribute those preprocessing functions along with the segmentation.

Installation from scratch in a fresh conda environment tested successfully by @snoreis on a machine with the following specs:
OS: Windows 11 Pro
CPU: 16-core Threadripper PRO 3955WX
GPU: NVIDIA RTX A5000

Of course also tested in my own environments.
Workstation:
OS: Rocky Linux 9.3
CPU: 8-core Intel Sky Lake
GPU: 1x NVIDIA Tesla L4 15GB

Cluster:
OS: Rocky Linux 9.3
CPU: 100 cores Intel Sky Lake
GPU: 100x NVIDIA Tesla L4 15GB

List of functions provided, all have verbose docstrings covering all inputs and outputs:
distributed_eval : run cellpose on a big image on any machine
process_block : the function that is run on each block from a big dataset, can be called on its own for testing
numpy_array_to_zarr : create a zarr array, preferred input to distributed_eval
wrap_folder_of_tiffs : represent folder of tiff files as zarr array without duplicating data

New dependencies are correctly set and install successfully with source: pip install -e .[distributed]

Examples
Run distributed Cellpose on half the resources of a workstation with 16 cpus, 1 gpu, and 128GB system memory:

from cellpose.contrib.distributed_segmentation import distributed_eval

# parameterize cellpose however you like
model_kwargs = {'gpu':True, 'model_type':'cyto3'}
eval_kwargs = {'diameter':30,
               'z_axis':0,
               'channels':[0,0],
               'do_3D':True,
}

# define myLocalCluster parameters
cluster_kwargs = {
    'n_workers':1,    # we only have 1 gpu, so no need for more workers
    'ncpus':8,
    'memory_limit':'64GB',
    'threads_per_worker':1,
}

# run segmentation
# segments: zarr array containing labels
# boxes: list of bounding boxes around all labels
segments, boxes = distributed_eval(
    input_zarr=large_zarr_array,
    blocksize=(256, 256, 256),
    write_path='/where/zarr/array/containing/results/will/be/written.zarr',
    model_kwargs=model_kwargs,
    eval_kwargs=eval_kwargs,
    cluster_kwargs=cluster_kwargs,
)

Run distributed Cellpose on an LSF cluster with 128 GPUs (e.g. Janelia cluster)
(Note this example is identical to the previous one, with only a few small changes to the cluster_kwargs; i.e. it is easy to go back and forth between workstations and clusters.)

from cellpose.contrib.distributed_segmentation import distributed_eval

# parameterize cellpose however you like
model_kwargs = {'gpu':True, 'model_type':'cyto3'}
eval_kwargs = {'diameter':30,
               'z_axis':0,
               'channels':[0,0],
               'do_3D':True,
}

# define myLocalCluster parameters
cluster_kwargs = {
    'ncpus':2,                     # cpus per worker
    'min_workers':8,          # cluster auto allocates and releases workers based on number of blocks left to process
    'max_workers':128,
    'queue':'gpu_l4',
    'job_extra_directives':['-gpu "num=1"'],
}

# run segmentation
# segments: zarr array containing labels
# boxes: list of bounding boxes around all labels
segments, boxes = distributed_eval(
    input_zarr=large_zarr_array,
    blocksize=(256, 256, 256),
    write_path='/where/zarr/array/containing/results/will/be/written.zarr',
    model_kwargs=model_kwargs,
    eval_kwargs=eval_kwargs,
    cluster_kwargs=cluster_kwargs,
)

Testing a single block before running a distributed computation:

from cellpose.contrib.distributed_segmentation import process_block

# define a crop as the distributed function would
starts = (128, 128, 128)
blocksize = (256, 256, 256)
overlap = 60
crop = tuple(slice(s-overlap, s+b+overlap) for s, b in zip(starts, blocksize))

# call the segmentation
segments, boxes, box_ids = process_block(
    block_index=(0, 0, 0),  # when test_mode=True this is just a dummy value
    crop=crop,
    input_zarr=my_zarr_array,
    model_kwargs=model_kwargs,
    eval_kwargs=eval_kwargs,
    blocksize=blocksize,
    overlap=overlap,
    output_zarr=None,
    test_mode=True,
)

Wrap a folder of tiff images/tiles into a single Zarr array:

# Note tiff filenames must indicate the position of each file in the overall tile grid
from cellpose.contrib.distributed_segmentation import wrap_folder_of_tiffs
reconstructed_virtual_zarr_array = wrap_folder_of_tiffs(
    filname_pattern='/path/to/folder/of/*.tiff', block_index_pattern=r'_(Z)(\d+)(Y)(\d+)(X)(\d+)',
)

Converting a large single tiff image to Zarr:

# Note full image will be loaded in system memory
from cellpose.contrib.distributed_segmentation import numpy_array_to_zarr
data_numpy = tifffile.imread('/path/to/image.tiff')
data_zarr = numpy_array_to_zarr('/path/to/output.zarr', data_numpy, chunks=(256, 256, 256))
del data_numpy

…tributed update my local working branch with package changes, hopefully fixes logging

…on in distribute

Merge branch 'main' of https://github.com/MouseLand/cellpose into distributed

…into distributed

…untested

…ogs. (1) io.logger_setup modified to accept alternative log file to stdout stream (2) distributed_eval creates datetime stamped log directory (3) individual workers create their own log files tagged with their name/index

…by default; no additional coding needed to leverage workstations with gpus

…tifffile - tifffile.imread(..., aszarr=True, ...) returns non-serializable array with single tiff input

…ust releases gpus and hard codes 1 cpu per worker - stitching is cheap, this will always fit

…ng on gc

…anelia LSF cluster cases

…cases in best way available given limitations of tiff files

This reverts commit 509ffca.

This reverts commit 767b752.

…ytorch v2+ untested but may also require this change

GFleishman · 2024-11-23T02:49:41Z

@krokicki

…version of pytorch; should be conditional and submitted in separate PR

fleishmang and others added 30 commits July 2, 2021 14:21

added distributed module

359e480

local labels now converted to global labels

b4e487c

reformat distributed_eval for better dask efficiency; added docstring

33edb39

merge conflicts resolved

a500880

masking changes

7768c1c

Merge branch 'main' of https://github.com/MouseLand/cellpose into dis…

d0e9c67

…tributed update my local working branch with package changes, hopefully fixes logging

refactor distribute, wip

e81684d

distribute updates

dbda61f

removed poor serial collection of boxes and box_ids

adbba2d

significant refactor of distribute - way more efficient

dd32886

added merge and split post processing functions; improved documentati…

6fc16ff

…on in distribute

merging with upstream

f930c69

switch CircuitSeeker to bigstream

40a3ff7

adding shape filter module

11f4b6c

pulling upstream changes into local version with distributed module

a71f673

Merge branch 'main' of https://github.com/MouseLand/cellpose into distributed

local sync

8ff1371

Merge branch 'MouseLand:main' into distributed

034c80e

refactor and documentation for distributed

db045ed

Merge branch 'distributed' of https://github.com/GFleishman/cellpose …

10a2898

…into distributed

complete merge

4086da4

putting cluster functions into native module - step one the decorator

93c102d

cluster related code moved into module itself - untested

90a8ec7

distributed function refactored out of main function - untested

f19c25d

finished with block function refactor and commenting - untested

5871d88

small refactorings - untested

5f38cb1

refactoring main function into components - untested

2d2e538

more component refactoring in main function - untested

d9b4e5a

first pass refactor of main function done - untested - needs review

739fab9

cleaned and organized imports - untested

afdd4bc

refactor totally complete - in memory functions tested - distributed …

59725aa

…untested

GFleishman added 24 commits May 21, 2024 21:07

testing corrections - testing still incomplete

86e35f8

bug fixes - still testing

20ba83e

all testing complete - updates complete

a2203e2

typo in setup.py

2f45602

some docstring improvements and the beginning of tiff related work

a0d662c

moved distributed deps to correct location, removed old deps

f43206d

zarr wrap of tiff folders complete first draft

e8bdf32

turns out when a workstation has a gpu, one worker gets access to it …

73a9cd4

…by default; no additional coding needed to leverage workstations with gpus

new logging system now works with process_block test_mode=True

b545103

more reliable username fetch, works for windows

b5dea03

handling edge case of single tiff image input; working around bug in …

7dd70fa

…tifffile - tifffile.imread(..., aszarr=True, ...) returns non-serializable array with single tiff input

stitching change_worker_attributes now uses same number of workers, j…

72ff1d8

…ust releases gpus and hard codes 1 cpu per worker - stitching is cheap, this will always fit

tempdir now a context manager, better guarantee of cleanup than relyi…

6a99eed

…ng on gc

improved documentation on how to set up cluster for workstation and J…

9617d3d

…anelia LSF cluster cases

single tiff and folder of tiff files handled separately to cover all …

fdffbfd

…cases in best way available given limitations of tiff files

merge resolution

5239cf9

Merge branch 'main' of https://github.com/MouseLand/cellpose

2dcb4de

renamed distributed_segmentation files in contrib for clarity

abe04ea

complete merge distributed branch into main

e62ae25

yaml --> pyyaml

767b752

Revert "adding 3d docs (MouseLand#883)"

7bf1732

This reverts commit 509ffca.

Revert "yaml --> pyyaml"

f6b9ff1

This reverts commit 767b752.

bugfix: testing revealed pytorch v1.12 and v1.13 require this line; p…

5c7888b

…ytorch v2+ untested but may also require this change

GFleishman mentioned this pull request Nov 23, 2024

contrib: Alternative distributed_segmentation, automated merge/split tools, ellipsoid shape filter #804

Closed

yaml --> pyyaml restored, added bokeh for dask dashboard

a81185c

revert tensor index datatype patch; probably only relevant for older …

a308a68

…version of pytorch; should be conditional and submitted in separate PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larger than memory images: performant and scalable distributed implementation for workstations and clusters #1062

Larger than memory images: performant and scalable distributed implementation for workstations and clusters #1062

GFleishman commented Nov 23, 2024

GFleishman commented Nov 23, 2024

Larger than memory images: performant and scalable distributed implementation for workstations and clusters #1062

Are you sure you want to change the base?

Larger than memory images: performant and scalable distributed implementation for workstations and clusters #1062

Conversation

GFleishman commented Nov 23, 2024

GFleishman commented Nov 23, 2024