Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Jun 26, 2018

Dask Scaling Limits

By

This work is supported by Anaconda Inc.

History

For the first year of Dask’s life it focused exclusively on single nodeparallelism. We felt then that efficiently supporting 100+GB datasets onpersonal laptops or 1TB datasets on large workstations was a sweet spot forproductivity, especially when avoiding the pain of deploying and configuringdistributed systems. We still believe in the efficiency of single-nodeparallelism, but in the years since, Dask has extended itself to support largerdistributed systems.

After that first year, Dask focused equally on both single-node and distributedparallelism. We maintain two entirely separateschedulers, one optimized foreach case. This allows Dask to be very simple to use on single machines, butalso scale up to thousand-node clusters and 100+TB datasets when needed withthe same API.

Dask’s distributed system has a single central scheduler and many distributedworkers. This is a common architecture today that scales out to a few thousandnodes. Roughly speaking Dask scales about the same as a system like ApacheSpark, but less well than a high-performance system like MPI.

An Example

Most Dask examples in blogposts or talks are on modestly sized datasets,usually in the 10-50GB range. This, combined with Dask’s history withmedium-data on single-nodes may have given people a more humble impression ofDask than is appropriate.

As a small nudge, here is an example using Dask to interact with 50 36-corenodes on an artificial terabyte dataset.

This is a common size for a typical modestly sized Dask cluster. We usuallysee Dask deployment sizes either in the tens of machines (usually with Hadoopstyle or ad-hoc enterprise clusters), or in the few-thousand range (usuallywith high performance computers or cloud deployments). We’re showing themodest case here just due to lack of resources. Everything in that exampleshould work fine scaling out a couple extra orders of magnitude.

Challenges to Scaling Out

For the rest of the article we’ll talk about common causes that we see todaythat get in the way of scaling out. These are collected from experienceworking both with people in the open source community, as well as privatecontracts.

Simple Map-Reduce style

If you’re doing simple map-reduce style parallelism then things will be prettysmooth out to a large number of nodes. However, there are still somelimitations to keep in mind:

  1. The scheduler will have at least one, and possibly a few connections opento each worker. You’ll want to ensure that your machines can have manyopen file handles at once. Some Linux distributions cap this at 1024 bydefault, but it is easy to change.
  2. The scheduler has an overhead of around 200 microseconds per task.So if each task takes one second then your scheduler can saturate 5000cores, but if each task takes only 100ms then your scheduler can onlysaturate around 500 cores, and so on. Task duration imposes an inverselyproportional constraint on scaling.
  3. If you want to scale larger than this then your tasks will need tostart doing more work in each task to avoid overhead. Often this involvesmoving inner for loops within tasks rather than spreading them out to manytasks.

More complex algorithms

If you’re doing more complex algorithms (which is common among Dask users) thenmany more things can break along the way. High performance computing isn’tabout doing any one thing well, it’s about doing nothing badly. This sectionlists a few issues that arise for larger deployments:

  1. Dask collection algorithms may be suboptimal.
  2. The parallel algorithms in Dask-array/bag/dataframe/ml are pretty good,but as Dask scales out to larger clusters and its algorithms are used bymore domains we invariably find that small corners of the API fail beyond acertain point. Luckily these are usually pretty easy to fix after they arereported.
  3. The graph size may grow too large for the scheduler
  4. The metadata describing your computation has to all fit on a singlemachine, the Dask scheduler. This metadata, the task graph, can grow bigif you’re not careful. It’s nice to have a scheduler process with at leasta few gigabytes of memory if you’re going to be processing million-nodetask graphs. A task takes up around 1kB of memory if you’re careful toavoid closing over any unnecessary local data.
  5. The graph serialization time may become annoying for interactive use
  6. Again, if you have million node task graphs you’re going to be serializaingthem up and passing them from the client to the scheduler. This is fine,assuming they fit at both ends, but can take up some time and limitinteractivity. If you press compute and nothing shows up on thedashboard for a minute or two, this is what’s happening.
  7. The interactive dashboard plots stop being as useful
  8. Those beautiful plots on the dashboard were mostly designed for deploymentswith 1-100 nodes, but not 1000s. Seeing the start and stop time of everytask of a million-task computation just isn’t something that our brains canfully understand.
  9. This is something that we would like to improve. If anyone out there isinterested in scalable performance diagnostics, please get involved.
  10. Other components that you rely on, like distributed storage, may also startto break
  11. Dask provides users more power than they’re accustomed to.It’s easy for them to accidentally clobber some other component of theirsystems, like distributed storage, a local database, the network, and soon, with too many requests.
  12. Many of these systems provide abstractions that are very well tested andstable for normal single-machine use, but that quickly become brittle whenyou have a thousand machines acting on them with the full creativity of anovice user. Dask provies some primitives like distributed locks andqueues to help control access to these resources, but it’s on the user touse them well and not break things.

Conclusion

Dask scales happily out to tens of nodes, like in the example above, or tothousands of nodes, which I’m not showing here simply due to lack of resources.

Dask provides this scalability while still maintaining the flexibility andfreedom to build custom systems that has defined the project since it began.However, the combination of scalability and freedom makes it hard for Dask tofully protect users from breaking things. It’s much easier to protect userswhen you can constrain what they can do. When users stick to standardworkflows like Dask dataframe or Dask array they’ll probably be ok, but whenoperating with full creativity at the thousand-node scale some expertise willinvariably be necessary. We try hard to provide the diagnostics and toolsnecessary to investigate issues and control operation. The project is gettingbetter at this every day, in large part due to some expert users out there.

A Call for Examples

Do you use Dask on more than one machine to do interesting work?We’d love to hear about it either in the comments below, or in this onlineform.