This work is supported by Continuum Analyticsthe XDATA Programand the Data Driven Discovery Initiative from the MooreFoundation
Dask just released version 0.14.0. This release contains some significantinternal changes as well as the usual set of increased API coverage and bugfixes. This blogpost outlines some of the major changes since the last releaseJanuary, 27th 2017.
You can install new versions using Conda or Pip
conda install -c conda-forge dask distributed
or
pip install dask[complete] distributed --upgrade
Dask collections (arrays, bags, dataframes, delayed) hold onto task graphs thathave all of the tasks necessary to create the desired result. For largerdatasets or complex calculations these graphs may have thousands, or sometimeseven millions of tasks. In some cases the overhead of handling these graphscan become significant.
This is especially true because dask collections don’t modify their graphs inplace, they make new graphs with updated computations. Copying graph datastructures with millions of nodes can take seconds and interrupt interactiveworkflows.
To address this dask.arrays and dask.delayed collections now use special graphdata structures with structural sharing. This significantly cuts down on theamount of overhead when building repetitive computations.
import dask.array as da
x = da.ones(1000000, chunks=(1000,)) # 1000 chunks of size 1000
%time for i in range(100): x = x + 1
CPU times: user 2.69 s, sys: 96 ms, total: 2.78 s
Wall time: 2.78 s
%time for i in range(100): x = x + 1
CPU times: user 756 ms, sys: 8 ms, total: 764 ms
Wall time: 763 ms
The difference in this toy problem is moderate but for real world cases thiscan difference can grow fairly large. This was also one of the blockersidentified by the climate science community stopping them from handlingpetabyte scale analyses.
We chose to roll this out for arrays and delayed first just because those arethe two collections that typically produce large task graphs. Dataframes andbags remain as before for the time being.
Dask communicates over TCP sockets. It uses Tornado’s IOStreams to handlenon-blocking communication, framing, etc.. We’ve run into some performanceissues with Tornado when moving large amounts of data. Some of this has beenimproved upstream in Tornadodirectly, but we still want the ability to optionally drop Tornado’sbyte-handling communication stack in the future. This is especially importantas dask gets used in institutions with faster and more exotic interconnects(supercomputers). We’ve been asked a few times to support other transportmechanisms like MPI.
The first step (and probably hardest step) was to make Dask’s communicationsystem is pluggable so that we can use different communication options withoutsignificant source-code changes. We managed this a month ago and now it ispossible to add other transports to Dask relatively easily. TCP remains theonly real choice today though there is also an experimental ZeroMQ option(which provides little-to-no performance benefit over TCP) as well as a fullyin-memory option in development.
For users the main difference you’ll see is that tcp:// is now prepended manyplaces. For example:
$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Scheduler at: tcp://192.168.1.115:8786
...
As usual the Pandas API has been more fully covered by community contributors.Some representative changes include the following:
There have also been a wide variety of changes to the distributed system. I’llinclude a representative sample here to give a flavor of what has beenhappening:
There are a number of smaller details not mentioned in this blogpost. For moreinformation visit the changelogs and documentation
Additionally a great deal of Dask work over the last month has happenedoutside of these core dask repositories.
You can install or upgrade using Conda or Pip
conda install -c conda-forge dask distributed
or
pip install dask[complete] distributed --upgrade
Since the last 0.13.0 release on January 27th the following developers havecontributed to the dask/dask repository:
And the following developers have contributed to the dask/distributedrepository: