Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sep 24, 2017

Dask Release 0.15.3

By

This work is supported by Anaconda Inc.and the Data Driven Discovery Initiative from the MooreFoundation.

I’m pleased to announce the release of Dask version 0.15.3. This releasecontains stability enhancements and bug fixes. This blogpost outlinesnotable changes since the 0.15.2 release on August 30th.

You can conda install Dask:

conda install -c conda-forge dask

or pip install from PyPI

pip install dask[complete] --upgrade

Conda packages are available both on conda-forge channels. They will be ondefaults in a few days.

Full changelogs are available here:

Some notable changes follow.

Masked Arrays

Dask.array now supports masked arrays similar to NumPy.

In [1]: import dask.array as da

In [2]: x = da.arange(10, chunks=(5,))

In [3]: mask = x % 2 == 0

In [4]: m = da.ma.masked_array(x, mask)

In [5]: m
Out[5]: dask.array<masked_array, shape=(10,), dtype=int64, chunksize=(5,)>

In [6]: m.compute()
Out[6]:
masked_array(data = [-- 1 -- 3 -- 5 -- 7 -- 9],
mask = [ True False True False True False True False True False],
fill_value = 999999)

This work was primarily done by Jim Crist and partially funded by the UKMet office in support of the Iris project.

Constants in atop

Dask.array experts will be familiar with the atopfunction, which powers a non-trivial amount of dask.array and is commonly used by people building custom algorithms. This function now supports constants when the index given is None.

atop(func, 'ijk', x, 'ik', y, 'kj', CONSTANT, None)

Memory management for workers

Dask workers spill excess data to disk when they reach 60% of their allotedmemory limit. Previously we only measured memory use by adding up the memoryuse of every piece of data produce by the worker. This could fail under a fewsituations

  1. Our per-data estiamtes were faulty
  2. User code consumed a large amount of memory without our tracking it

To compensate we now also periodically check the memory use of the worker usingsystem utilities with the psutil module. We dump data to disk if the processrises about 70% use, stop running new tasks if it rises above 80%, and restartthe worker if it rises above 95% (assuming that the worker has a nannyprocess).

Breaking Change: Previously the --memory-limit keyword to thedask-worker process specified the 60% “start pushing to disk” limit. So ifyou had 100GB of RAM then you previously might have started a dask-worker asfollows:

dask-worker ... --memory-limit 60e9 # before specify 60% target

And the worker would start pushing to disk once it had 60GB of data in memory.However, now we are changing this meaning to be the full amount of memory givento the process.

dask-worker ... --memory-limit 100e9A # now specify 100% target

Of course, you don’t have to sepcify this limit (many don’t). It will bechosen for you automatically. If you’ve never cared about this then youshouldn’t start caring now.

More about memory management here: http://distributed.readthedocs.io/en/latest/worker.html?highlight=memory-limit#memory-management

Statistical Profiling

Workers now poll their worker threads every 10ms and keep a running count ofwhich functions are being used. This information is available on thediagnostic dashboard as a new “Profile” page. It provides information that isorthogonal, and generally more detailed than the typical task-stream plot.

These plots are available on each worker, and an aggregated view is availableon the scheduler. The timeseries on the bottom allows you to select timewindows of your computation to restrict the parallel profile.

More information about diagnosing performance available here:http://distributed.readthedocs.io/en/latest/diagnosing-performance.html

Acknowledgements

The following people contributed to the dask/dask repository since the 0.15.2release on August 30th

  • Adonis
  • Christopher Prohm
  • Danilo Horta
  • jakirkham
  • Jim Crist
  • Jon Mease
  • jschendel
  • Keisuke Fujii
  • Martin Durant
  • Matthew Rocklin
  • Tom Augspurger
  • Will Warner

The following people contributed to the dask/distributed repository since the1.18.3 release on September 2nd:

  • Casey Law
  • Edrian Irizarry
  • Matthew Rocklin
  • rbubley
  • Tom Augspurger
  • ywangd