I’m pleased to announce the release of Dask version 0.15.0. This releasecontains performance and stability enhancements as well as some breakingchanges. This blogpost outlines notable changes since the last release on May5th.
As always you can conda install Dask:
conda install dask distributed
or pip install from PyPI
pip install dask[complete] --upgrade
Conda packages are available both on the defaults and conda-forge channels.
Full changelogs are available here:
Some notable changes follow.
Thanks to recent changes in NumPy 1.13.0, NumPy ufuncs now operate asDask.array ufuncs. Previously they would convert their arguments into Numpyarrays and then operate concretely.
import dask.array as da
import numpy as np
x = da.arange(10, chunks=(5,))
array([ 0, -1, -2, -3, -4, -5, -6, -7, -8, -9])
dask.array<negative, shape=(10,), dtype=int64, chunksize=(5,)>
To celebrate this change we’ve also improved support for more of the NumPyufunc and reduction API, such as support for out parameters. This means that anon-trivial subset of the actual NumPy API works directly out-of-the box withdask.arrays. This makes it easier to write code that seamlessly works witheither array type.
Note: the ufunc feature requires that you update NumPy to 1.13.0 or later.Packages are available through PyPI and conda on the defaults and conda-forgechannels.
The Dask.distributed API is capable of operating within a Tornado or Asyncioevent loop, which can be useful when integrating with other concurrent systemslike web servers or when building some more advanced algorithms in machinelearning and other fields. The API to do this used to be somewhat hidden andonly known to a few and used underscores to signify that methods wereasynchronous.
client = Client(start=False)
future = client.submit(func, *args)
result = await client._gather(future)
These methods are still around, but the process of starting the client haschanged and we now recommend using the fully public methods even inasynchronous situations (these used to block).
client = await Client(asynchronous=True)
future = client.submit(func, *args)
result = await client.gather(future) # no longer use the underscore
You can also await futures directly:
result = await future
You can use yield instead of await if you prefer Python 2.
More information is available at https://distributed.readthedocs.org/en/latest/asynchronous.html.
The single-machine scheduler used to live in the dask.async module. Withasync becoming a keyword since Python 3.5 we’re forced to rename this. Youcan now find the code in dask.local. This will particularly affect anyonewho was using the single-threaded scheduler, previously known asdask.async.get_sync. The term dask.get can be used to reliably refer tothe single-threaded base scheduler across versions.
Early blogposts referred to functions like futures_to_dask_array whichresided in the distributed.collections module. These have since beenentirely replaced by better interactions between Futures and Delayed objects.This module has been removed entirely.
Dask workers create a directory where they can place temporary files.Typically this goes into your operating system’s temporary directory (/tmp onLinux and Mac).
Some users on network file systems specify this directory explicitly with thedask-worker ... --local-directory option, pointing to some other better placelike a local SSD drive. Previously Dask would dump files into the provideddirectory. Now it will create a new subdirectory and place files there. Thistends to be much more convenient for users on network file systems.
$ dask-worker scheduler-address:8786 --local-directory /scratch
$ ls /scratch
$ ls /scratch/worker-1234/
user-script.py disk-storage/ ...
Previously the map method would inspect functions and automatically expandtuples to fill arguments:
import dask.bag as db
b = db.from_sequence([(1, 10), (2, 20), (3, 30)])
>>> b.map(lambda x, y: return x + y).compute()
[11, 22, 33]
While convenient, this behavior gave rise to corner cases and stopped us frombeing able to support multi-bag mapping functions. It has since been removed.As an advantage though, you can now map two co-partitioned bags together.
a = db.from_sequence([1, 2, 3])
b = db.from_sequence([10, 20, 30])
>>> db.map(lambda x, y: x + y, a, b).compute()
[11, 22, 33]
Clients and Futures have nicer HTML reprs that show up in the Jupyter notebook.
And the dashboard stays a decent width and has a new navigation bar with linksto other dashboard pages. This template is now consistently applied to alldashboard pages.
When using Dask to power Joblibcomputations (such as occur in Scikit-Learn) with the joblib.parallel_backendcontext manager, you can now pre-scatter select data to all workers. This cansignificantly speed up some scikit-learn computations by reducing repeated datatransfer.
from sklearn.externals.joblib import parallel_backend
# Serialize the training data only once to each worker
with parallel_backend('dask.distributed', scheduler_host='localhost:8786',
As usual, a number of bugs were identified and resolved and a number ofperformance optimizations were implemented. Thank you to all users anddevelopers who continue to help identify and implement areas for improvement.Users should generally have a smoother experience.
We have removed the experimental ZeroMQ networking backend. This was notparticularly useful in practice. However it was very effective in serving asan example while we were making our network communication layer pluggable withdifferent protocols.
The following related projects have also been released recently and may beworth updating:
The following people contributed to the dask/dask repository since the 0.14.3 releaseon May 5th:
The following people contributed to the dask/distributed repository since the1.16.2 release on May 5th: