Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
May 19, 2015

State of Dask

By

This work is supported by Continuum Analyticsand the XDATA Programas part of the Blaze Project

tl;dr We lay out the pieces of Dask, a system for parallel computing

Introduction

Dask started five months ago as a parallel on-disk array; it hassince broadened out. I’ve enjoyed writing about itsdevelopment tremendously.With the recent 0.5.0releaseI decided to take a moment to give an overview of dask’s various pieces, theirstate, and current development.

Collections, graphs, and schedulers

Dask modules can be separated as follows:

Partitioned Frame design

On the left there are collections like arrays, bags, and dataframes. Thesecopy APIs for NumPy, PyToolz, and Pandas respectively and are aimed towardsdata science users, allowing them to interact with larger datasets. Operationson these dask collections produce task graphs which are recipes to compute thedesired result using many smaller computations that each fit in memory. Forexample if we want to sum a trillion numbers then we might break the numbersinto million element chunks, sum those, and then sum the sums. A previouslyimpossible task becomes a million and one easy ones.

On the right there are schedulers. Schedulers execute task graphs in differentsituations, usually in parallel. Notably there are a few schedulers for asingle machine, and a new prototype for a distributedscheduler.

In the center is the directed acyclic graph. This graph serves as glue betweencollections and schedulers. The dask graph format is simple and doesn’tinclude any dask classes; it’s just functions, dicts, andtuples and so is easy tobuild on and low-tech enough to understand immediately. This separation is veryuseful to dask during development; improvements to one side immediately affectthe other and new developers have had surprisingly little trouble. Alsodevelopers from a variety of backgrounds have been able to come up to speed inabout an hour.

This separation is useful to other projects too. Directed acyclic graphs arepopular today in many domains. By exposing dask’s schedulers publicly, otherprojects can bypass dask collections and go straight for the execution engine.

A flattering quote from a githubissue:

dask has been very helpful so far, as it allowed me to skip implementingall of the usual graph operations. Especially doing the asynchronousexecution properly would have been a lot of work.

Who uses dask?

Dask developers work closely with a few really amazing users:

  1. Stephan Hoyer at Climate Corp has integrateddask.array into xray a library to manage largevolumes of meteorlogical data (and other labeled arrays.)
  2. Scikit image now includes an apply_paralleloperation (github PR)that uses dask.array to parallelize image processing routines.(work by Blake Griffith)
  3. Mariano Tepper a postdoc at Duke, usesdask in his research on matrix factorizations. Mariano is also the primaryauthor of the dask.array.linalg module, which includes efficient and stableQR and SVD for tall and skinny matrices. See Mariano’s paper onarXiv.
  4. Finally I personally use dask on daily work related to the XDataproject. This tends todrive some of the newer features.

A few other groups pop up on github from time to time; I’d love toknow more detail about how people use dask.

What works and what doesn’t

Dask is modular. Each of the collections and each of the schedulers areeffectively separate projects. These subprojects are at different states ofdevelopment. Knowing the stability of each subproject can help you todetermine how you use and depend on dask.

Dask.array and dask.threaded work well, are stable, and see constant use.They receive relatively minor bug reports which are dealt with swiftly.

Dask.bag and dask.multiprocessing undergo more API churn but are mostlyready for public use with a couple of caveats. Neither dask.dataframe nor

dask.distributed are ready for public use; they undergo significant API churnand have known errors.

Current work

The current state of development as I see it is as follows:

  1. Dask.bag and dask.dataframe are progressing nicely. My personal workdepends on these modules, so they see a lot of attention.
  2. At the moment I focus on grouping and join operations through fastshuffles; I hope to write about this problem soon.
  3. The Pandas API is large and complex. Reimplementing a subset of itin a blocked way is straightforward but also detailed and time consuming.This would be a great place for community contributions.
  4. Dask.distributed is new. It needs it tires kicked but it’s an excitingdevelopment. For deployment we’re planning to bootstrap off ofIPython parallel whichalready has decent coverage of many parallel job systems,(see #208 by Blake)
  5. Dask.array development these days focuses on outreach. We’ve foundapplication domains where dask is very useful; we’d like to find more.
  6. The collections (Array, Bag, DataFrame) don’t cover all cases. I wouldlike to start finding uses for the task schedulers in isolation. Theyserve as a release valve in complex situations.

More information

You can install dask with conda

conda install dask

or with pip

pip install dask
or
pip install dask[array]
or
pip install dask[bag]

You can read more about dask at the docsor github.