This work is supported by Anaconda Inc.
For the first year of Dask’s life it focused exclusively on single nodeparallelism. We felt then that efficiently supporting 100+GB datasets onpersonal laptops or 1TB datasets on large workstations was a sweet spot forproductivity, especially when avoiding the pain of deploying and configuringdistributed systems. We still believe in the efficiency of single-nodeparallelism, but in the years since, Dask has extended itself to support largerdistributed systems.
After that first year, Dask focused equally on both single-node and distributedparallelism. We maintain two entirely separateschedulers, one optimized foreach case. This allows Dask to be very simple to use on single machines, butalso scale up to thousand-node clusters and 100+TB datasets when needed withthe same API.
Dask’s distributed system has a single central scheduler and many distributedworkers. This is a common architecture today that scales out to a few thousandnodes. Roughly speaking Dask scales about the same as a system like ApacheSpark, but less well than a high-performance system like MPI.
Most Dask examples in blogposts or talks are on modestly sized datasets,usually in the 10-50GB range. This, combined with Dask’s history withmedium-data on single-nodes may have given people a more humble impression ofDask than is appropriate.
As a small nudge, here is an example using Dask to interact with 50 36-corenodes on an artificial terabyte dataset.
This is a common size for a typical modestly sized Dask cluster. We usuallysee Dask deployment sizes either in the tens of machines (usually with Hadoopstyle or ad-hoc enterprise clusters), or in the few-thousand range (usuallywith high performance computers or cloud deployments). We’re showing themodest case here just due to lack of resources. Everything in that exampleshould work fine scaling out a couple extra orders of magnitude.
For the rest of the article we’ll talk about common causes that we see todaythat get in the way of scaling out. These are collected from experienceworking both with people in the open source community, as well as privatecontracts.
If you’re doing simple map-reduce style parallelism then things will be prettysmooth out to a large number of nodes. However, there are still somelimitations to keep in mind:
If you’re doing more complex algorithms (which is common among Dask users) thenmany more things can break along the way. High performance computing isn’tabout doing any one thing well, it’s about doing nothing badly. This sectionlists a few issues that arise for larger deployments:
Dask scales happily out to tens of nodes, like in the example above, or tothousands of nodes, which I’m not showing here simply due to lack of resources.
Dask provides this scalability while still maintaining the flexibility andfreedom to build custom systems that has defined the project since it began.However, the combination of scalability and freedom makes it hard for Dask tofully protect users from breaking things. It’s much easier to protect userswhen you can constrain what they can do. When users stick to standardworkflows like Dask dataframe or Dask array they’ll probably be ok, but whenoperating with full creativity at the thousand-node scale some expertise willinvariably be necessary. We try hard to provide the diagnostics and toolsnecessary to investigate issues and control operation. The project is gettingbetter at this every day, in large part due to some expert users out there.
Do you use Dask on more than one machine to do interesting work?We’d love to hear about it either in the comments below, or in this onlineform.