To increase transparency I’m trying to blog more often about the current workgoing on around Dask and related projects. Nothing here is ready forproduction. This blogpost is written in haste, so refined polish should not beexpected.
Current development in Dask and Dask-related projects includes the followingefforts:
Dask community communication generally happens in Github issues for bug andfeature tracking, the Stack Overflow #dask tag for user questions, and aninfrequently used Gitter chat.
Separately, Dask developers who work for Anaconda Inc (there are about five ofus part-time) use an internal company chat and a closed weekly video meeting.We’re now trying to migrate away from closed systems when possible.
Details about future directions are in dask/dask#2945. Thoughts and comments onthat issue would be welcome.
When you start building clusters with 1000 workers the distributed schedulercan become a bottleneck on some workloads. After working with PyPy and Cythondevelopment teams we’ve decided to rewrite parts of the scheduler to make itmore amenable to acceleration by those technologies. Note that no actualacceleration has occurred yet, just a refactor of internal state.
Previously the distributed scheduler was focused around a large set of Pythondictionaries, sets, and lists that indexed into each other heavily. This wasdone both for low-tech code technology reasons and for performance reasons(Python core data structures are fast). However, compiler technologies likePyPy and Cython can optimize Python object access down to C speeds, so we’reexperimenting with switching away from Python data structures to Python objectsto see how much this is able to help.
This change will be invisible operationally (the full test suite remainsvirtually unchanged), but will be a significant change to the scheduler’sinternal state. We’re keeping around a compatibility layer, but people whowere building their own diagnostics around the internal state should check outwith the new changes.
Ongoing work by Antoine Pitrou in dask/distributed #1594
In service of the Pangeo project to enablescalable data analysis of atmospheric and oceanographic data we’ve beenimproving the tooling around launching Dask on Cloud infrastructure,particularly leveraging Kubernetes.
To that end we’re making some flexible Docker containers and Helm Charts forDask, and hope to combine them with JupyterHub in the coming weeks.
Work done by myself in the following repositories. Feedback would be verywelcome. I am learning on the job with Helm here.
If you use Helm on Kubernetes then you might want to try the following:
helm repo add dask https://dask.github.io/helm-chart
helm install dask/dask
This installs a full Dask cluster and a Jupyter server. The Docker containerscontain entry points that allow their environments to be updated with custompackages easily.
This work extends prior work on the previous package,dask-kubernetes, but is slightlymore modular for use alongside other systems.
Adaptive deployments, where a cluster manager scales a Dask cluster up or downbased on current workloads recently got a makeover, including a number of bugfixes around odd or infrequent behavior.
Work done by Russ Bubley here:
NumPy 1.14 is due to release soon. Dask.array had to update how it handledstructured dtypes in dask/dask #2694(Work by Tom Augspurger).
Dask.dataframe is gaining the ability to merge/join simultaneously on columnsand indices, following a similar feature released in Pandas 0.22. Work done byJon Mease in dask/dask #2960