Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Jan 3, 2019

GPU Dask Arrays, first steps

By

The following code creates and manipulates 2 TB of randomly generated data.

import dask.array as da

rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

On a single CPU, this computation takes two hours.

On an eight-GPU single-node system this computation takes nineteen seconds.

Combine Dask Array with CuPy

Actually this computation isn’t that impressive.It’s a simple workload,for which most of the time is spent creating and destroying random data.The computation and communication patterns are simple,reflecting the simplicity commonly found in data processing workloads.

What is impressive is that we were able to create a distributed parallel GPUarray quickly by composing these four existing libraries:

  1. CuPy provides a partial implementation ofNumpy on the GPU.
  2. Dask Array provides chunkedalgorithms on top of Numpy-like libraries like Numpy and CuPy.
  3. This enables us to operate on more data than we could fit in memoryby operating on that data in chunks.
  4. The Dask distributed task scheduler runsthose algorithms in parallel, easily coordinating work across many CPUcores.
  5. The Dask CUDA to extend Daskdistributed with GPU support.

These tools already exist. We had to connect them together with a small amountof glue code and minor modifications. By mashing these tools together we canquickly build and switch between different architectures to explore what isbest for our application.

For this example we relied on the following changes upstream:

Comparison among single/multi CPU/GPU

We can now easily run some experiments on different architectures. This iseasy because …

  • We can switch between CPU and GPU by switching between Numpy and CuPy.
  • We can switch between single/multi-CPU-core and single/multi-GPUby switching between Dask’s different task schedulers.

These libraries allow us to quickly judge the costs of this computation forthe following hardware choices:

  1. Single-threaded CPU
  2. Multi-threaded CPU with 40 cores (80 H/T)
  3. Single-GPU
  4. Multi-GPU on a single machine with 8 GPUs

We present code for these four choices below,but first,we present a table of results.

Results

Architecture Time Single CPU Core 2hr 39min Forty CPU Cores 11min 30s One GPU 1 min 37s Eight GPUs 19s

Setup

import cupy
import dask.array as da

# generate chunked dask arrays of mamy numpy random arrays
rs = da.random.RandomState()
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

print(x.nbytes / 1e9) # 2 TB
# 2000.0

CPU timing

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')
(x + 1)[::2, ::2].sum().compute(scheduler='threads')

Single GPU timing

We switch from CPU to GPU by changing our data source to generate CuPy arraysrather than NumPy arrays. Everything else should more or less work the samewithout special handling for CuPy.

(This actually isn’t true yet, many things in dask.array will break fornon-NumPy arrays, but we’re working on it actively both within Dask, withinNumPy, and within the GPU array libraries. Regardless, everything in thisexample works fine.)

# generate chunked dask arrays of mamy cupy random arrays
rs = da.random.RandomState(RandomState=cupy.random.RandomState) # <-- we specify cupy here
x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))

(x + 1)[::2, ::2].sum().compute(scheduler='single-threaded')

Multi GPU timing

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

(x + 1)[::2, ::2].sum().compute()

And again, here are the results:

Architecture Time Single CPU Core 2hr 39min Forty CPU Cores 11min 30s One GPU 1 min 37s Eight GPUs 19s

First, this is my first time playing with an 40-core system. I was surprisedto see that many cores. I was also pleased to see that Dask’s normal threadedscheduler happily saturates many cores.

Although later on it did dive down to around 5000-6000%, and if you do the mathyou’ll see that we’re not getting a 40x speedup. My guess is thatperformance would improve if we were to play with some mixture of threads andprocesses, like having ten processes with eight threads each.

The jump from the biggest multi-core CPU to a single GPU is still an order ofmagnitude though. The jump to multi-GPU is another order of magnitude, andbrings the computation down to 19s, which is short enough that I’m willing towait for it to finish before walking away from my computer.

Actually, it’s quite fun to watch on the dashboard (especially after you’vebeen waiting for three hours for the sequential solution to run):

Conclusion

This computation was simple, but the range in architecture just explored wasextensive. We swapped out the underlying architecture from CPU to GPU (whichhad an entirely different codebase) and tried both multi-core CPU parallelismas well as multi-GPU many-core parallelism.

We did this in less than twenty lines of code, making this experiment somethingthat an undergraduate student or other novice could perform at home.We’re approaching a point where experimenting with multi-GPU systems isapproachable to non-experts (at least for array computing).

Here is a notebook for the experiment above

Room for improvement

We can work to expand the computation above in a variety of directions.There is a ton of work we still have to do to make this reliable.

  1. Use more complex array computing workloads
  2. The Dask Array algorithms were designed first around Numpy. We’ve onlyrecently started making them more generic to other kinds of arrays (likeGPU arrays, sparse arrays, and so on). As a result there are still manybugs when exploring these non-Numpy workloads.
  3. For example if you were to switch sum for mean in the computation aboveyou would get an error because our mean computation contains an easy tofix error that assumes Numpy arrays exactly.
  4. Use Pandas and cuDF instead of Numpy and CuPy
  5. The cuDF library aims to reimplement the Pandas API on the GPU,much like how CuPy reimplements the NumPy API.Using Dask DataFrame with cuDF will require some work on both sides,but is quite doable.
  6. I believe that there is plenty of low-hanging fruit here.
  7. Improve and move LocalCUDACluster
  8. The LocalCUDAClutster class used above is an experimental Cluster typethat creates as many workers locally as you have GPUs, and assigns eachworker to prefer a different GPU. This makes it easy for people to loadbalance across GPUs on a single-node system without thinking too much aboutit. This appears to be a common pain-point in the ecosystem today.
  9. However, the LocalCUDACluster probably shouldn’t live in thedask/distributed repository (it seems too CUDA specific) so will probablymove to some dask-cuda repository. Additionally there are still manyquestions about how to handle concurrency on top of GPUs, balancing betweenCPU cores and GPU cores, and so on.
  10. Multi-node computation
  11. There’s no reason that we couldn’t accelerate computations like thesefurther by using multiple multi-GPU nodes. This is doable today withmanual setup, but we should also improve the existing deployment solutionsdask-kubernetes,dask-yarn, anddask-jobqueue, to make this easier fornon-experts who want to use a cluster of multi-GPU resources.
  12. Expense
  13. The machine I ran this on is expensive. Well, it’s nowhere close to asexpensive to own and operate as a traditional cluster that you would needfor these kinds of results, but it’s still well beyond the price point of ahobbyist or student.
  14. It would be useful to run this on a more budget system to get a sense ofthe tradeoffs on more reasonably priced systems. I should probably alsolearn more about provisioning GPUs on the cloud.

Come help!

If the work above sounds interesting to you then come help!There is a lot of low-hanging and high impact work to do.

If you’re interested in being paid to focus more on these topics, then considerapplying for a job. The NVIDIA corporation is hiring around the use of Daskwith GPUs.

That’s a fairly generic posting. If you’re interested the posting doesn’t seemto fit then please apply anyway and we’ll tweak things.