Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Jan 30, 2017

Dask Development Log

By

This work is supported by Continuum Analyticsthe XDATA Programand the Data Driven Discovery Initiative from the MooreFoundation

To increase transparency I’m blogging weekly about the work done on Dask andrelated projects during the previous week. This log covers work done between2017-01-17 and 2016-01-30. Nothing here is ready for production. Thisblogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

  1. Micro-release of distributed scheduler
  2. as_completed for asynchronous algorithms
  3. Testing ZeroMQ and communication refactor
  4. Persist, and Dask-GLM

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.2, which includes some importantperformance improvements for communicating multi-dimensional arrays, cleanerscheduler shutdown of workers for adaptive deployments, an improvedas_completed iterator that can accept new futures in flight, and a few othersmaller features.

The feature here that excites me the most is improved communication ofmulti-dimensional arrays across the network. In arecent blogpost about image processing on a clusterwe noticed that communication bandwidth was far lower than expected. This ledus to uncover a flaw in our compression heuristics. Dask doesn’t compress alldata, instead it takes a few 10kB samples of the data, compresses them, and ifthat goes well, decides to compress the entire thing. Unfortunately, due toour mishandling of memoryviews we ended up taking much larger samples than1kB when dealing with multi-dimensional arrays.

XArray release

This improvement is particularly timely because a new release ofXArray (a project that wraps aroundDask.array for large labeled arrays) is now available with better dataingestion support for NetCDF data on distributed clusters. This opens updistributed array computing to Dask’s first (and possibly largest) scientificuser community, atmospheric scientists.

as_completed accepts futures.

In addition to arrays, dataframes, and the delayed decorator, Dask.distributedalso implements the concurrent.futuresinterface from the standard library (except that Dask’s version parallelizesacross a cluster and has a few other benefits). Part of this interface is theas_completed function, which takes in alist of futures and yields those futures in the order in which they finish.This enables the construction of fairly responsive asynchronous computations.As soon as some work finishes you can look at the result and submit more workbased on the current state.

That has been around in Dask for some time.

What’s new is that you can now push more futures into as_completed

futures = client.map(my_function, sequence)

ac = as_completed(futures)
for future in ac:
result = future.result() # future is already finished, so this is fast
if condition:
new_future = client.submit(function, *args, **kwargs)
ac.add(new_future) # <<---- This is the new ability

So the as_completed iterator can keep going for quite a while with a set offutures always in flight. This relatively simple change allows for the easyexpression of a broad set of asynchronous algorithms.

ZeroMQ and Communication Refactor

As part of a large effort, Antoine Pitrou has been refactoring Dask’scommunication system. One sub-goal of this refactor was to allow us to exploreother transport mechanisms than Tornado IOStreams in a pluggable way.

One such alternative is ZeroMQ sockets. We’ve gotten both incredibly positiveand incredibly negative praise/warnings about using ZeroMQ. It’s not a greatfit because Dask mostly just does point-to-point communication, so we don’tbenefit from all of ZeroMQ’s patterns, which now become more of a hindrancethan a benefit. However, we were interested in the performance impact ofmanaging all of our network communication in a completely separately managedC++ thread unaffected by GIL issues.

Whether or you hate or love ZeroMQ you can now pick and choose. Antoine’sbranch allows for easy swapping between transport mechanisms and opens thepossibility for more in the future like intra-process communication withQueues, MPI, etc.. This doesn’t affect most users, but some Dask deploymentsare on supercomputers with exotic networks capable of very fast speeds. Thepossibility that we might tap into Infiniband someday and have the ability tomanage data locally without copies (Tornado does not achieve this) is veryattractive to some user communities.

After very preliminary benchmarking we’ve found that ZeroMQ offers a smallspeedup, but results in a lack of stability in the cluster under complexworkloads (likely our fault, not ZeroMQs, but still). ZeroMQ support isstrictly experimental and will likely stay that way for a long time. Readersshould not get excited about this.

Perist and Dask-GLM

In collaboration with researchers at Capital One we’ve been working on a setof parallel solvers for first and second ordermethods, such as are commonly used in a broadclass of statistical and machine learning algorithms.

One challenge in this process has been building algorithms that aresimultaneously optimal for both the single machine and distributed schedulers.The distributeed scheduler requires that we think about where data is, on theclient or on the cluster, where for the single machine scheudler this is lessof a problem. The distrbuted scheduler appropriately has a new verb,persist, which keeps data as a Dask collection, but triggers all of theinternal computation

compute(dask array) -> numpy array
persist(dask array) -> dask array

We have now mirrored this verb to the single machine scheduler in dask/dask#1927 and we get very nice performanceon dask-glm’s algorithms in both cases now.

Working with the developers at Capital One has been very rewarding. I wouldlike to find more machine learning groups that fit the following criteria:

  1. Are focused on performance
  2. Need parallel computing
  3. Are committed to built good open source software
  4. Are sufficiently expert in their field to understand correct algorithms

If you know such a person or such a group, either in industry or just a gradstudent in university, please encourage them to raise an issue athttp://github.com/dask/dask/issues/new . I will likely write a larger blogposton this topic in the near future.