Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Nov 12, 2020

Image Analysis Redux

By

Summary

Last year we experimented withDask/ITK/Scikit-Image to perform large scale image analysis on a stack of 3Dimages. Specifically, we looked at deconvolution, a common method to deblurimages. Now, a year later, we return to these experiments with a betterunderstanding of how Dask and CuPy can interact, enhanced serializationmethods, and support from the open-source community. This post looks at thefollowing:

  1. Implementing a common deconvolution method for CPU + GPU
  2. Leveraging Dask to perform deconvolution on a larger dataset
  3. Exploring the results with the Napari image viewer

Image Analysis Redux

Previously we used the Richardson Lucy(RL)deconvolution algorithm from ITK andScikit-Image.We left off at theorizing how GPUs could potentially help accelerate theseworkflows. Starting with Scikit-Image’s implementation, we naively triedreplacing scipy.signal.convolve calls with cupyx.scipy.ndimage.convolve,and while performance improved, it did not improve significantly – that is,we did not get the 100X speed we were looking for.

As it often turns out in mathematics a problem that can be inefficient to solvein one representation can often be made more efficent by transforming the databeforehand. In this new representation we can solve the same problem(convolution in this case) more easily before transforming the result back intoa more familiar representation. When it comes to convolution, thetransformation we apply is called Fast-Fourier Transform(FFT). Once this isapplied we are able to convolve data using a simple multiplication.

As it turns out this FFT transformation is extremely fast on both CPUs andGPUs. Similarly the algorithm we can write with FFTs is accelerated. This is acommonly used technique in the image processing field to speed up convolutions.Despite the added step of doing FFTs, the cost of transformation + the cost ofthe algorithm is still lower than performing the original algorithm in realspace. We (and others before us) found this was the case for Richardson Lucy(on both CPUs and GPUs) and performance continued increasing when weparallelized with Dask over multiple GPUs.

Help from Open-Source

An FFT RL equivalent has been around for some time and the good folks at theSolar Dynamics Observatorybuilt and shared a NumPy/CuPy implementation as part the Atmospheric ImagingAssemblyPython package (aiapy). We slightly modified their implementation to handle 3Das well as 2D Point SpreadFunctions and to takeadvantage ofNEP-18 forconvenient dispatching of NumPy and CuPy to NumPy and CuPy functions:

def deconvolve(img, psf=None, iterations=20):
# Pad PSF with zeros to match image shape
pad_l, pad_r = np.divmod(np.array(img.shape) - np.array(psf.shape), 2)
pad_r += pad_l
psf = np.pad(psf, tuple(zip(pad_l, pad_r)), 'constant', constant_values=0)

# Recenter PSF at the origin
# Needed to ensure PSF doesn't introduce an offset when
# convolving with image
for i in range(psf.ndim):
psf = np.roll(psf, psf.shape[i] // 2, axis=i)

# Convolution requires FFT of the PSF
psf = np.fft.rfftn(psf)

# Perform deconvolution in-place on a copy of the image
# (avoids changing the original)
img_decon = np.copy(img)
for _ in range(iterations):
ratio = img / np.fft.irfftn(np.fft.rfftn(img_decon) * psf)
img_decon *= np.fft.irfftn((np.fft.rfftn(ratio).conj() * psf).conj())
return img_decon

For a 1.3 GB image we measured the following:

  • CuPy ~3 seconds for 20 iterations
  • NumPy ~36 seconds for 2 iterations

We see 10x increase in speed for 10 times the number of iterations – veryclose to our desired 100x speedup! Let’s explore how this implementationperforms with real biological data and Dask…

Define a Dask Cluster and Load the Data

We were provided sample data from Prof.Shroff’s lab at theNIH. The data originally was provided as a 3D TIFF file which we subsequentlyconverted to Zarr with a shape of (950, 2048, 2048).

We start by creating a Dask cluster on a DGX2 (16 GPUs in a single node) andloading the image stored in Zarr :

Example Notebook

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask.array as da

import rmm
import cupy as cp

cluster = LocalCUDACluster(
local_directory="/tmp/bzaitlen",
enable_nvlink=True,
rmm_pool_size="26GB",
)
client = Client(cluster)

client.run(
cp.cuda.set_allocator,
rmm.rmm_cupy_allocator
)

imgs = da.from_zarr("/public/NVMICROSCOPY/y1z1_C1_A.zarr/")

Array Chunk Bytes 7.97 GB 8.39 MB Shape (950, 2048, 2048) (1, 2048, 2048) Count 951 Tasks 950 Chunks Type uint16 numpy.ndarray 20482048950

From the Dask output above you can see the data is a z-stack of 950 imageswhere each slice is 2048x2048. For this data set, we can improve GPUperformance if we operate on larger chunks. Additionally, we need to ensurethe chunks are are least as big as the PSF which in this case is, (128, 128,128). As we did our work on a DGX2, which has 16 GPUs, we can comfortably fitthe data and perform deconvolution on each GPU if we rechunk the dataaccordingly:

# chunk with respect to PSF shape (128, 128, 128)
imgs = imgs.rechunk(chunks={0: 190, 1: 512, 2: 512})
imgs

Array Chunk Bytes 7.97 GB 99.61 MB Shape (950, 2048, 2048) (190, 512, 512) Count 967 Tasks 80 Chunks Type uint16 numpy.ndarray

Next, we convert to float32 as the data may not already be of floating pointtype. Also 32-bit is a bit faster than 64-bit when computing and saves a bit onmemory. Below we map cupy.asarray onto each block of data. cupy.asarraymoves the data from host memory (NumPy) to the device/GPU (CuPy).

imgs = imgs.astype(np.float32)
c_imgs = imgs.map_blocks(cp.asarray)

Array Chunk Bytes 15.94 GB 199.23 MB Shape (950, 2048, 2048) (190, 512, 512) Count 80 Tasks 80 Chunks Type float32 cupy.ndarray 20482048950

What we now have is a Dask array composed of 16 CuPy blocks of data. Noticehow Dask provides nice typing information in the SVG output. When we movedfrom NumPy to CuPy, the block diagram above displays Type: cupy.ndarray –this is a nice sanity check.

The last piece we need before running the deconvolution is the PSF which shouldalso be loaded onto the GPU:

import skimage.io

psf = skimage.io.imread("/public/NVMICROSCOPY/PSF.tif")
c_psf = cp.asarray(psf)

Lastly, we call map_overlap with the deconvolve function across the Daskarray:

out = da.map_overlap(
deconvolve,
c_imgs,
psf=c_psf,
iterations=100,
meta=c_imgs._meta,
depth=tuple(np.array(c_psf.shape) // 2),
boundary="periodic"
)
out

The image above is taken from a mouse intestine.

With Dask and multiple GPUs, we measured deconvolution of an 16GB image in ~30seconds! But this is just the first step towards accelerated image science.

Napari

Deconvolution is just one operation and one tool, an image scientist ormicroscopist will need. They will need other tools as they study theunderlying biology. Before getting to those next steps, they will need toolsto visualize the data. Napari, a multi-dimensional imageviewer used in the PyData Bio ecosystem, is a good tool for visualizing thisdata. As an experiment, we ran the same workflow on a local workstation with 2Quadro RTX 8000 GPUs connected with NVLink. ExampleNotebook

By adding a map_blocks call to our array, we can move our data back fromGPU to CPU (device to host).

def cupy_to_numpy(x):
import cupy as cp
return cp.asnumpy(x)

np_out = out.map_blocks(cupy_to_numpy, meta=out)

When the user moves the slider on the Napari UI, we are instructing dask to thefollowing:

  • Load the data from disk onto the GPU (CuPy)
  • Compute the deconvolution
  • Move back to the host (NumPy)
  • Render with Napari

This has about a second latency which is great for a naive implementation! Wecan improve this by adding caching, improving communications withmap_overlap, and optimizing the deconvolution kernel.

Conclusion

We have now shown with Dask + CuPy how one can perform Richardson-LucyDeconvolution. This required a minimal amount of code. Combining this with animage viewer (Napari), we were able to inspect the data and our result. All ofthis performed reasonably well by assembling PyData libraries: Dask, CuPy,Zarr, and Napari with a new deconvolution kernel. Hopefully this provides youa good template to get started analyzing your own data and demonstrates therichness and easy expression of custom workflows. If you run into anychallenges, please reach out on the Dask issuetracker and we would be happy to engagewith you :)