This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Dask turned one yesterday. We discuss success and failures.

Dask began one year ago yesterday with the following commit (with slight edits here for clarity’s sake).

def istask(x):
    return isinstance(x, tuple) and x and callable(x[0])


def get(d, key):
    v = d[key]
    if istask(v):
        func, args = v[0], v[1:]
        return func(*[get(d, arg) for arg in args])
    else:
        return v

 

‍
... (and around 50 lines of tests)

this is a very inefficient scheduler

Since then dask has matured, expanded to new domains, gathered excellentdevelopers,and spawned other open source projects. I thought it’d be a good time to lookback on what worked, what didn’t, and what we should work on in the future.

Collections

Most users experience dask through the high-level collections ofdask.array/bag/dataframe/imperative. Each of these evolve as projects oftheir own with different user groups and different levels of maturity.

dask.array

The parallel larger-than-memory array module dask.array has seen the mostsuccess of the dask components. It is the oldest, most mature, and mostsophisticated subproject. Much of dask.array’s use comes from downstreamprojects, notably xray which seems tohave taken off in climate science. Dask.array also sees a fair amount of usein imaging, genomics, and numerical algorithms research.

People that I don’t know now use dask.array to do scientific research. From myperspective that’s mission accomplished.

There are still tweaks to make to algorithms, particularly as we scale out todistributed systems (see far below).

dask.bag

Dask.bag started out as a weekend project and didn’t evolve much beyond that.Fortunately there wasn’t much to do and this submodule probably has the highestvalue/effort ratio .

Bag doesn’t get as much attention as its older sibling array though. It’shandy but not as well used and so not as robust.

dask.dataframe

Dataframe is an interesting case, it’s both pretty sophisticated, prettymature, and yet also probably generates the most user frustration.

Dask.dataframe gains a lot of value by leveraging Pandas both under the hood(one dask DataFrame is many pandas DataFrames) and by copying its API (Pandasusers can use dask.dataframe without learning a new API.) However, becausedask.dataframe only implements a core subset of Pandas, users end up trippingup on the missing functionality.

This can be decomposed into to issues:

It’s not clear that there exists a core subset of Pandas that would handle mostuse cases. Users touch many diffuse parts of Pandas in a single workflow.What one user considers core another user considers fringe. It’s not clearhow to agree on a sufficient subset to implement.
Once you implement this subset (and we’ve done our best) it’s hard toconvey expectations to the user about what is and is not available.

That being said, dask.dataframe is pretty solid. It’s very fast, expressive,and handles common use cases well. It probably generates the mostStackOverflow questions. This signals both confusion and active use.

Special thanks here go out to Jeff Reback, for making Pandas release the GILand to Masaaki Horikoshi (@sinhrks) for greatly improving the maturity ofdask.dataframe.

dask.imperative

Also known as dask.do this little backend remains one of the most powerfuland one of the least used (outside of myself.) We should rethink the API hereand improve learning materials.

General thoughts on collections

Warning: this section is pretty subjective

Big data collections are cool but perhaps less useful than people expect.Parallel applications are often more complex than can be easily described by abig array or a big dataframe. Many real-world parallel computations end upbeing more particular in their parallelism needs. That’s not to say that thearray and dataframe abstractions aren’t central to parallel computing, it’sjust that we should not restrict ourselves to them. The world is more complex.

However, it’s reasonable to break this “world is complex” rule withinparticular domains. NDArrays seem to work well in climate science.Specialized large dataframes like Dato’s SFrame seem to be effective for aparticular class of machine learning algorithms. The SQL table is inarguablyan effective abstraction in business intelligence. Large collections areuseful in specific contexts, but they are perhaps the focus of too muchattention. The big dataframe in particular is over-hyped.

Most of the really novel and impressive work I’ve seen with dask has been doneeither with custom graphs or with the dask.imperative API. I think we shouldconsider APIs that enable users to more easily express custom algorithms.

Avoid Parallelism

When giving talks on parallelism I’ve started to give a brief “avoidparallelism” section. From the problems I see on stack overflow and fromgeneral interactions when people run into performance challenges their firstsolution seems to be to parallelize. This is sub-optimal. It’s often farcheaper to improve storage formats, use better algorithms, or use C/Numbaaccelerated code than it is to parallelize. Unfortunately storage formatsand C aren’t as sexy as big data parallelism, so they’re not in the forefrontof people’s minds. We should change this.

I’ll proudly buy a beer for anyone that helps to make storage formats a sexiertopic.

Scheduling

Single Machine

The single machine dynamic task scheduler is very very solid. It has roughlytwo objectives:

Use all the cores of a machine
Choose tasks that allow the release of intermediate results

This is what allows us to quickly execute complex workflows in small space.This scheduler underlies all execution within dask. I’m very happy with it. Iwould like to find ways to expose it more broadly to other libraries.Suggestions are very welcome here.

We still run into cases where it doesn’t perform optimally(see issue 874),but so far we’ve always been able to enhance the scheduler whenever these casesarise.

Distributed Cluster

Over the last few months we’ve been working on another scheduler fordistributed memory computation. It should be a nice extension to the existingdask collections out to “big data” systems. It’s experimental but usable nowwith documentation at the follow links:

Feedback is welcome. I recommend waiting for a month or two if you preferclean and reliable software. It will undergo a name-change to something lessgeneric.

Submit New Event

Submit News Feature

Dask is one year old

Collections

dask.array

dask.bag

dask.dataframe

dask.imperative

General thoughts on collections

Avoid Parallelism

Scheduling

Single Machine

Distributed Cluster

Submit New Event

Submit News Feature

Sign up for Newsletter

Dask is one year old

Collections

dask.array

dask.bag

dask.dataframe

dask.imperative

General thoughts on collections

Avoid Parallelism

Scheduling

Single Machine

Distributed Cluster