This work is supported by Anaconda Inc.and the Data Driven Discovery Initiative from the MooreFoundation.
I’m pleased to announce the release of Dask version 0.17.2. This is a minorrelease with new features and stability improvements.This blogpost outlines notable changes since the 0.17.0 release on February12th.
You can conda install Dask:
conda install dask
or pip install from PyPI:
pip install dask[complete] --upgrade
Full changelogs are available here:
Some notable changes follow:
Tornado is a popular framework for concurrent network programming that Daskrelies on heavily. Tornado recently released a major version update thatincluded both some major features for Dask as well as a couple of bugs.
The new IOStream.read_into method allows Dask communications (or anyone usingthis API) to move large datasets more efficiently over the network withfewer copies. This enables Dask to take advantage of high performancenetworking available on modern super-computers. On the Cheyenne system, wherewe tested this, we were able to get the full 3GB/s bandwidth available throughthe Infiniband network with this change (when using a few worker processes).
Many thanks to Antoine Pitrou and BenDarnell for their efforts on this.
At the same time there were some unforeseen issues in the update to Tornado 5.0.More pervasive use of bytearrays over bytes caused issues with compressionlibraries like Snappy and Python 2 that were not expecting these types. Thereis a brief window in distributed.__version__ == 1.21.3 that enables thisfunctionality if Tornado 5.0 is present but will misbehave if Snappy is alsopresent.
Dask leverages a file-system-like protocolfor access to remote data.This is what makes commands like the following work:
import dask.dataframe as dd
df = dd.read_parquet('s3://...')
df = dd.read_parquet('hdfs://...')
df = dd.read_parquet('gcs://...')
We have now added http and https file systems for reading data directly fromweb servers. These also support random access if the web server supports rangequeries.
df = dd.read_parquet('https://...')
As with S3, HDFS, GCS, … you can also use these tools outside of Daskdevelopment. Here we read the first twenty bytes of the Pandas license:
from dask.bytes.http import HTTPFileSystem
http = HTTPFileSystem()
with http.open('https://raw.githubusercontent.com/pandas-dev/pandas/master/LICENSE') as f:
print(f.read(20))
b'BSD 3-Clause License'
Thanks to Martin Durant who did this workand manages Dask’s byte handling generally. See remote data documentation for more information.
We identified and resolved a correctness bug in dask.dataframe’s shuffle thatresulted in some rows being dropped during complex operations like joins andgroupby-applies with many partitions.
See dask/dask #3201 for more information.
There are many Python subprojects that help you deploy Dask on differentcluster resource managers like Yarn, SGE, Kubernetes, PBS, and more. Thesehave all converged to have more-or-less the same API that we have now combinedinto a consistent interface that downstream projects can inherit from indistributed.deploy.Cluster.
Now that we have a consistent interface we have started to invest more inimproving the interface and intelligence of these systems as a group. Thisincludes both pleasant IPython widgets like the following:
as well as improved logic around adaptive deployments. Adaptive deploymentsallow clusters to scale themselves automatically based on current workload. Ifyou have recently submitted a lot of work the scheduler will estimate itsduration and ask for an appropriate number of workers to finish the computationquickly. When the computation has finished the scheduler will release theworkers back to the system to free up resources.
The logic here has improved substantially including the following:
Some news from related projects:
The following people contributed to the dask/dask repository since the 0.17.0release on February 12h:
The following people contributed to the dask/distributed repository since the1.21.0 release on February 12th: