This work is supported by Anaconda Inc.
I’m pleased to announce the release of Dask version 0.18.0. This is a majorrelease with breaking changes and new features.The last release was 0.17.5 on May 4th.This blogpost outlines notable changes since the last release blogpost for0.17.2 on March 21st.
You can conda install Dask:
conda install dask
or pip install from PyPI:
pip install dask[complete] --upgrade
Full changelogs are available here:
We list some breaking changes below, followed up by changes that are lessimportant, but still fun.
The Dask core library is nearing a 1.0 release.Before that happens, we need to do some housecleaning.This release starts that process,replaces some existing interfaces,and builds up some needed infrastructure.Almost all of the changes in this release include clean deprecation warnings,but future releases will remove the old functionality, so now would be a goodtime to check in.
As happens with any release that starts breaking things,many other smaller breaks get added on as well.I’m personally very happy with this release because many aspects of using Dasknow feel a lot cleaner, however heavy users of Dask will likely experiencemild friction. Hopefully this post helps explain some of the larger changes.
Taking full advantage of Dask sometimes requires user configuration, especiallyin a distributed setting. This might be to control logging verbosity, specifycluster configuration, provide credentials for security, or any of severalother options that arise in production.
We’ve found that different computing cultures like to specify configuration inseveral different ways:
Previously this was handled with a variety of different solutions among thedifferent dask subprojects. The dask-distributed project had one system,dask-kubernetes had another, and so on.
Now we centralize configuration in the dask.config module, which collectsconfiguration from config files, environment variables, and runtime code, andmakes it centrally available to all Dask subprojects. A number of Dasksubprojects (dask.distributed,dask-kubernetes, anddask-jobqueue), are beingco-released at the same time to take advantage of this.
If you were actively using Dask.distributed’s configuration files some thingshave changed:
However, your old configuration files will still be found and their valueswill be used appropriately. We don’t make any attempt to migrate your oldconfig values to the new location though. You may want to delete theauto-generated ~/.dask/config.yaml file at some point, if you felt like beingparticularly clean.
You can learn more about Dask’s configuration in Dask’s configurationdocumentation
Dask can execute code with a variety of scheduler backends based on threads,processes, single-threaded execution, or distributed clusters.
Previously, users selected between these backends using the somewhatgenerically named get= keyword:
x.compute(get=dask.threaded.get)
x.compute(get=dask.multiprocessing.get)
x.compute(get=dask.local.get_sync)
We’ve replaced this with a newer, and hopefully more clear, scheduler= keyword:
x.compute(scheduler='threads')
x.compute(scheduler='processes')
x.compute(scheduler='single-threaded')
The get= keyword has been deprecated and will raise a warning. It will beremoved entirely on the next major release.
For more information, see documentation on selecting different schedulers.
Related to the configuration changes, we now include runtime state in theconfiguration. Previously people used to set runtime state with thedask.set_options context manager. Now we recommend using dask.config.set:
with dask.set_options(scheduler='threads'): # Before
...
with dask.config.set(scheduler='threads'): # After
...
The dask.set_options function is now an alias to dask.config.set.
This was unadvertised and saw very little use. All functionality (and muchmore) is now available in Dask-ML.
Dask.array now supports Numpy-styleGeneralized Universal Functions (gufuncs)transparently.This means that you can apply normal Numpy GUFuncs, like eig in the examplebelow, directly onto a Dask arrays:
import dask.array as da
import numpy as np
# Apply a Numpy GUFunc, eig, directly onto a Dask array
x = da.random.normal(size=(10, 10, 10), chunks=(2, 10, 10))
w, v = np.linalg._umath_linalg.eig(x, output_dtypes=(float, float))
# w and v are dask arrays with eig applied along the latter two axes
Numpy has gufuncs of many of its internal functions, but they haven’tyet decided to switch these out to the public API.Additionally we can define GUFuncs with other projects, like Numba:
import numba
@numba.vectorize([float64(float64, float64)])
def f(x, y):
return x + y
z = f(x, y) # if x and y are dask arrays, then z will be too
What I like about this is that Dask and Numba developers didn’t coordinateat all on this feature, it’s just that they both support the Numpy GUFuncprotocol, so you get interactions like this for free.
For more information see Dask’s GUFunc documentation. This work was done by Markus Gonser (@magonser).
Dask arrays now accept a value, “auto”, wherever a chunk value would previouslybe accepted. This asks Dask to rechunk those dimensions to achieve a gooddefault chunk size.
x = x.rechunk({
0: x.shape[0], # single chunk in this dimension
# 1: 100e6 / x.dtype.itemsize / x.shape[0], # before we had to calculate manually
1: 'auto' # Now we allow this dimension to respond to get ideal chunk size
})
# or
x = da.from_array(img, chunks='auto')
This also checks the array.chunk-size config value for optimal chunk sizes
>>> dask.config.get('array.chunk-size')
'128MiB'
To be clear, this doesn’t support “automatic chunking”, which is a very hardproblem in general. Users still need to be aware of their computations and howthey want to chunk, this just makes it marginally easier to make gooddecisions.
Dask.array gained a full einsum implementation thanks to Simon Perkins.
Also, Dask.array’s QR decompositions has become nicer in two ways:
This work is greatly appreciated and was done by Jeremy Chan.
Native support for the Zarr format forchunked n-dimensional arrays landed thanks to MartinDurant and John AKirkham. Zarr has been especially useful due toits speed, simple spec, support of the full NetCDF style conventions, andamenability to cloud storage.
As usual, Dask Dataframes had many small improvements. Of note is continuedcompatibility with the just-released Pandas 0.23, and some new data ingestionformats.
Dask.dataframe is consistent with changes in the recent Pandas 0.23 releasethanks to Tom Augspurger.
Dask.dataframe has grown a reader for the Apache ORC format.
Orc is a format for tabular data storage that is common in the Hadoop ecosystem.The newdd.read_orcfunction parallelizes around similarly new ORC functionality within PyArrow .Thanks to Jim Crist for the work on the Arrow sideand Martin Durant for parallelizing it withDask.
Dask.dataframe now has also grown a reader for JSON files.
The dd.read_jsonfunction matches most of the pandas.read_json API.
This came about shortly after a recent PyCon 2018 talk comparing Spark andDask dataframe where IrinaTruong mentioned that it was missing. Thanks toMartin Durant and IrinaTruong for this contribution.
See the dataframe data ingestion documentationfor more information about JSON, ORC, or any of the other formatssupported by Dask.dataframe.
The Joblib library for parallel computing withinScikit-Learn has had a Dask backendfor a while now. While it has always been pretty easy to use, it’s nowbecoming much easier to use well without much expertise. After using this inpractice for a while together with the Scikit-Learn developers, we’veidentified and smoothed over a number of usability issues. These changes willonly be fully available after the next Scikit-Learn release (hopefully soon) atwhich point we’ll probably release a new blogpost dedicated to the topic.
This release is timed with the following packages:
There is also a new repository for deploying applications on YARN (a jobscheduler common in Hadoop environments) calledskein. Early adopters welcome.
Since March 21st, the following people have contributed to the following repositories:
The core Dask repository for parallel algorithms:
The dask/distributed repository for distributed computing:
The dask-kubernetes repository for deploying Dask on Kubernetes
The dask-jobqueue repository for deploying Dask on HPC job schedulers
The dask-ml repository for scalable machine learning:
Thanks to Scott Sievert and James Bourbeau for their help editing this article.