Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sep 18, 2017

Dask on HPC - Initial Work

By

This work is supported by Anaconda Inc. and the NSFEarthCube program.

We recentlyannounceda collaboration between the National Center for Atmospheric Research(NCAR), ColumbiaUniversity, and Anaconda Inc to accelerate theanalysis of atmospheric and oceanographic data on high performance computers(HPC) with XArray and Dask. The fulltext ofthe proposed work is availablehere. Weare very grateful to the NSF EarthCube program for funding this work, whichfeels particularly relevant today in the wake (and continued threat) of themajor storms Harvey, Irma, and Jose.

This is a collaboration of academic scientists (Columbia), infrastructurestewards (NCAR), and software developers (Anaconda and Columbia and NCAR) toscale current workflows with XArray and Jupyter onto big-iron HPC systems andpeta-scale datasets. In the first week after the grant closed a few of usfocused on the quickest path to get science groups up and running with XArray,Dask, and Jupyter on these HPC systems. This blogpost details what we achievedand some of the new challenges that we’ve found in that first week. We hope tofollow this blogpost with many more to come in the future.Today we cover the following topics:

  1. Deploying Dask with MPI
  2. Interactive deployments on a batch job scheduler, in this case PBS
  3. The virtues of JupyterLab in a remote system
  4. Network performance and 3GB/s infiniband
  5. Modernizing XArray’s interactions with Dask’s distributed scheduler

A video walkthrough deploying Dask on XArray on an HPC system is available onYouTube and instructions foratmospheric scientists with access to the CheyenneSupercomputeris availablehere.

Now lets start with technical issues:

Deploying Dask with MPI

HPC systems use job schedulers like SGE, SLURM, PBS, LSF, and others. Daskhas been deployed on all of these systems before either by academic groups orfinancial companies. However every time we do this it’s a little different andgenerally tailored to a particular cluster.

We wanted to make something more general. This started out as a GitHub issueon PBS scripts that tried tomake a simple common template that people could copy-and-modify.Unfortunately, there were significant challenges with this. HPC systems andtheir job schedulers seem to focus and easily support only two common usecases:

  1. Embarrassingly parallel “run this script 1000 times” jobs. This is toosimple for what we have to do.
  2. MPI jobs. Thisseemed like overkill, but is the approach that we ended up taking.

Deploying dask is somewhere between these two. It falls into the master-slavepattern (or perhaps more appropriately coordinator-workers). We ended upbuilding an MPI4Py program thatlaunches Dask. MPI is well supported, and more importantly consistentlysupported, by all HPC job schedulers so depending on MPI provides a level ofstability across machines. Now dask.distributed ships with a new dask-mpiexecutable:

mpirun --np 4 dask-mpi

To be clear, Dask isn’t using MPI for inter-process communication. It’s stillusing TCP. We’re just using MPI to launch a scheduler and several workers andhook them all together. In pseudocode the dask-mpi executable lookssomething like this:

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
start_dask_scheduler()
else:
start_dask_worker()

Socially this is useful because every cluster management team knows how tosupport MPI, so anyone with access to such a cluster has someone they can askfor help. We’ve successfully translated the question “How do I start Dask?” tothe question “How do I run this MPI program?” which is a question that thetechnical staff at supercomputer facilities are generally much better equippedto handle.

Working Interactively on a Batch Scheduler

Our collaboration is focused on interactive analysis of big datasets. Thismeans that people expect to open up Jupyter notebooks, connect to clustersof many machines, and compute on those machines while they sit at theircomputer.

Unfortunately most job schedulers were designed for batch scheduling. Theywill try to run your job quickly, but don’t mind waiting for a few hours for anice set of machines on the super computer to open up. As you ask for moretime and more machines, waiting times can increase drastically. For most MPIjobs this is fine because people aren’t expecting to get a result right awayand they’re certainly not interacting with the program, but in our case wereally do want some results right away, even if they’re only part of what weasked for.

Handling this problem long term will require both technical work and policydecisions. In the short term we take advantage of two facts:

  1. Many small jobs can start more quickly than a few large ones. These takeadvantage of holes in the schedule that are too small to be used by largerjobs.
  2. Dask doesn’t need to be started all at once. Workers can come and go.

And so I find that if I ask for several single machine jobs I can easily cobbletogether a sizable cluster that starts very quickly. In practice this lookslike the following:

qsub start-dask.sh # only ask for one machine
qsub add-one-worker.sh # ask for one more machine
qsub add-one-worker.sh # ask for one more machine
qsub add-one-worker.sh # ask for one more machine
qsub add-one-worker.sh # ask for one more machine
qsub add-one-worker.sh # ask for one more machine
qsub add-one-worker.sh # ask for one more machine

Our main job has a wall time of about an hour. The workers have shorter walltimes. They can come and go as needed throughout the computation as ourcomputational needs change.

Jupyter Lab and Web Frontends

Our scientific collaborators enjoy building Jupyter notebooks of their work.This allows them to manage their code, scientific thoughts, and visual outputsall at once and for them serves as an artifact that they can share with theirscientific teams and collaborators. To help them with this we start a Jupyterserver on the same machine in their allocation that is running the Daskscheduler. We then provide them with SSH-tunneling lines that they cancopy-and-paste to get access to the Jupyter server from their personalcomputer.

We’ve been using the new Jupyter Lab rather than the classic notebook. This isespecially convenient for us because it provides much of the interactiveexperience that they lost by not working on their local machine. They get afile browser, terminals, easy visualization of textfiles and so on withouthaving to repeatedly SSH into the HPC system. We get all of this functionalityon a single connection and with an intuitive Jupyter interface.

For now we give them a script to set all of this up. It starts Jupyter Labusing Dask and then prints out the SSH-tunneling line.

from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

import socket
host = client.run_on_scheduler(socket.gethostname)

def start_jlab(dask_scheduler):
import subprocess
proc = subprocess.Popen(['jupyter', 'lab', '--ip', host, '--no-browser'])
dask_scheduler.jlab_proc = proc

client.run_on_scheduler(start_jlab)

print("ssh -N -L 8787:%s:8787 -L 8888:%s:8888 -L 8789:%s:8789 cheyenne.ucar.edu" % (host, host, host))

Long term we would like to switch to an entirely point-and-click interface(perhaps something like JupyterHub) but this will requires additional thinkingabout deploying distributed resources along with the Jupyter server instance.

Network Performance on Infiniband

The intended computations move several terabytes across the cluster.On this cluster Dask gets about 1GB/s simultaneous read/write network bandwidthper machine using the high-speed Infiniband network. For any commodity orcloud-based system this is very fast (about 10x faster than what I observe onAmazon). However for a super-computer this is only about 30% of what’spossible (see hardware specs).

I suspect that this is due to byte-handling in Tornado, the networking librarythat Dask uses under the hood. The following image shows the diagnosticdashboard for one worker after a communication-heavy workload. We see 1GB/sfor both read and write. We also see 100% CPU usage.

Network performance is a big question for HPC users looking at Dask. If we canget near MPI bandwidth then that may help to reduce concerns for thisperformance-oriented community.

How do I use Infiniband network with Dask?

XArray and Dask.distributed

XArray was the first major project to use Dask internally. This earlyintegration was critical to prove out Dask’s internals with user feedback.However it also means that some parts of XArray were designed well before someof the newer parts of Dask, notably the asynchronous distributed schedulingfeatures.

XArray can still use Dask on a distributed cluster, but only with the subset offeatures that are also available with the single machine scheduler. This meansthat persisting data in distributed RAM, parallel debugging, publishing shareddatasets, and so on all require significantly more work today with XArray thanthey should.

To address this we plan to update XArray to follow a newly proposed Daskinterface.This is complex enough to handle all Dask scheduling features, but light weightenough not to actually require any dependence on the Dask library itself.(Work by Jim Crist.)

We will also eventually need to look at reducing overhead for inspectingseveral NetCDF files, but we haven’t yet run into this, so I plan to wait.

Future Work

We think we’re at a decent point for scientific users to start playing with thesystem. We have a Getting Started with Dask on Cheyennewiki page that our first set of guinea pig users have successfully run throughwithout much trouble. We’ve also identified a number of issues that thesoftware developers can work on while the scientific teams spin up.

  1. Zero copy Tornado writes to improve network bandwidth
  2. Enable Dask.distributed features in XArray by formalizing dask’s expected interface
  3. Dynamic deployments on batch job schedulers

We would love to engage other collaborators throughout this process. If you oryour group work on related problems we would love to hear from you. This grantisn’t just about serving the scientific needs of researchers at Columbia andNCAR, but about building long-term systems that can benefit the entireatmospheric and oceanographic community. Please engage on thePangeo GitHub issue tracker.