Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sep 5, 2018

Dask Release 0.19.0

By

This work is supported by Anaconda Inc.

I’m pleased to announce the release of Dask version 0.19.0. This is a majorrelease with bug fixes and new features. The last release was 0.18.2 on July23rd. This blogpost outlines notable changes since the last release blogpostfor 0.18.0 on June 14th.

You can conda install Dask:

conda install dask

or pip install from PyPI:

pip install dask[complete] --upgrade

Full changelogs are available here:

Notable Changes

A ton of work has happened over the past two months, but most of the changesare small and diffuse. Stability, feature parity with upstream libraries (likeNumpy and Pandas), and performance have all significantly improved, but in waysthat are difficult to condense into blogpost form.

That being said, here are a few of the more exciting changes in the newrelease.

Python Versions

We’ve dropped official support for Python 3.4 and added official support forPython 3.7.

Deploy on Hadoop Clusters

Over the past few months Jim Crist has bulit asuite of tools to deploy applications on YARN, the primary cluster manager usedin Hadoop clusters.

  • Conda-pack: packs up Condaenvironments for redistribution to distributed clusters, especially whenPython or Conda may not be present.
  • Skein: easily launches and manages YARNapplications from non-JVM systems
  • Dask-Yarn: a thin libraryaround Skein to launch and manage Dask clusters

Jim has written about Skein and Dask-Yarn in two recent blogposts:

Implement Actors

Some advanced workloads want to directly manage and mutate state on workers. Atask-based framework like Dask can be forced into this kind of workload usinglong-running-tasks, but it’s an uncomfortable experience.

To address this we’ve added an experimental Actors framework to Dask alongsidethe standard task-scheduling system. This provides reduced latencies, removesscheduling overhead, and provides the ability to directly mutate state on aworker, but loses niceties like resilience and diagnostics.The idea to adopt Actors was shamelessly stolen from the Ray Project :)

class Counter:
def __init__(self):
self.n = 0

def increment(self):
self.n += 1
return self.n

counter = client.submit(Counter, actor=True).result()

>>> future = counter.increment()
>>> future.result()
1

You can read more about actors in the Actors documentation.

Dashboard improvements

The Dask dashboard is a critical tool to understand distributed performance.There are a few accessibility issues that trip up beginning users that we’veaddressed in this release.

Save task stream plots

You can now save a task stream record by wrapping a computation in theget_task_stream context manager.

from dask.distributed import Client, get_task_stream
client = Client(processes=False)

import dask
df = dask.datasets.timeseries()

with get_task_stream(plot='save', filename='my-task-stream.html') as ts:
df.x.std().compute()

>>> ts.data
[{'key': "('make-timeseries-edc372a35b317f328bf2bb5e636ae038', 0)",
'nbytes': 8175440,
'startstops': [('compute', 1535661384.2876947, 1535661384.3366017)],
'status': 'OK',
'thread': 139754603898624,
'worker': 'inproc://192.168.50.100/15417/2'},

...

This gives you the start and stop time of every task on every worker doneduring that time. It also saves that data as an HTML file that you can sharewith others. This is very valuable for communicating performance issues withina team. I typically upload the HTML file as a gist and then share it withrawgit.com

$ gist my-task-stream.html
https://gist.github.com/f48a121bf03c869ec586a036296ece1a

Robust to different screen sizes

The Dashboard’s layout was designed to be used on a single screen, side-by-sidewith a Jupyter notebook. This is how many Dask developers operate when workingon a laptop, however it is not how many users operate for one of two reasons:

  1. They are working in an office setting where they have several screens
  2. They are new to Dask and uncomfortable splitting their screen into twohalves

In these cases the styling of the dashboard becomes odd. Fortunately, LukeCanavan and DerekLudwig recently improved the CSS for thedashboard considerably, allowing it to switch between narrow and wide screens.Here is a snapshot.

Jupyter Lab Extension

You can now embed Dashboard panes directly within Jupyter Lab using the newlyupdated dask-labextension.

jupyter labextension install dask-labextension

This allows you to layout your own dashboard directly within JupyterLab. Youcan combine plots from different pages, control their sizing, and so on. Youwill need to provide the address of the dashboard server(http://localhost:8787 by default on local machines) but after thateverything should persist between sessions. Now when I open up JupyterLab andstart up a Dask Client, I get this:

Thanks to Ian Rose for doing most of the workhere.

Outreach

Dask Stories

People who use Dask have been writing about their experiences at DaskStories. In the last couplemonths the following people have written about and contributed their experience:

  1. Civic Modelling at Sidewalk Labs by Brett Naul
  2. Genome Sequencing for Mosquitoes by Alistair Miles
  3. Lending and Banking at Full Spectrum by Hussain Sultan
  4. Detecting Cosmic Rays at IceCube by James Bourbeau
  5. Large Data Earth Science at Pangeo by Ryan Abernathey
  6. Hydrological Modelling at the National Center for Atmospheric Research by Joe Hamman
  7. Mobile Networks Modeling by Sameer Lalwani
  8. Satellite Imagery Processing at the Space Science and Engineering Center by David Hoese

These stories help people understand where Dask is and is not applicable, andprovide useful context around how it gets used in practice. We welcome furthercontributions to this project. It’s very valuable to the broader community.

Dask Examples

The Dask-Examples repository maintainseasy-to-run examples using Dask on a small machine, suitable for an entry-levellaptop or for a small cloud instance. These are hosted onmybinder.org and are integrated into our documentation.A number of new examples have arisen recently, particularly in machinelearning. We encourage people to try them out by clicking the link below.

Binder

Other Projects

  • The dask-image project wasrecently released. It includes a number of image processing routines arounddask arrays.
  • This project is mostly maintained by John Kirkham.
  • Dask-ML saw a recent bugfix release
  • The TPOT library for automatedmachine learning recently published a new release that adds Dask support toparallelize their model training. More information is available on theTPOT documentation

Acknowledgements

Since June 14th, the following people have contributed to the following repositories:

The core Dask repository for parallel algorithms:

  • Anderson Banihirwe
  • Andre Thrill
  • Aurélien Ponte
  • Christoph Moehl
  • Cloves Almeida
  • Daniel Rothenberg
  • Danilo Horta
  • Davis Bennett
  • Elliott Sales de Andrade
  • Eric Bonfadini
  • GPistre
  • George Sakkis
  • Guido Imperiale
  • Hans Moritz Günther
  • Henrique Ribeiro
  • Hugo
  • Irina Truong
  • Itamar Turner-Trauring
  • Jacob Tomlinson
  • James Bourbeau
  • Jan Margeta
  • Javad
  • Jeremy Chen
  • Jim Crist
  • Joe Hamman
  • John Kirkham
  • John Mrziglod
  • Julia Signell
  • Marco Rossi
  • Mark Harfouche
  • Martin Durant
  • Matt Lee
  • Matthew Rocklin
  • Mike Neish
  • Robert Sare
  • Scott Sievert
  • Stephan Hoyer
  • Tobias de Jong
  • Tom Augspurger
  • WZY
  • Yu Feng
  • Yuval Langer
  • minebogy
  • nmiles2718
  • rtobar

The dask/distributed repository for distributed computing:

  • Anderson Banihirwe
  • Aurélien Ponte
  • Bartosz Marcinkowski
  • Dave Hirschfeld
  • Derek Ludwig
  • Dror Birkman
  • Guillaume EB
  • Jacob Tomlinson
  • Joe Hamman
  • John Kirkham
  • Loïc Estève
  • Luke Canavan
  • Marius van Niekerk
  • Martin Durant
  • Matt Nicolls
  • Matthew Rocklin
  • Mike DePalatis
  • Olivier Grisel
  • Phil Tooley
  • Ray Bell
  • Tom Augspurger
  • Yu Feng

The dask/dask-examples repository for easy-to-run examples:

  • Albert DeFusco
  • Dan Vatterott
  • Guillaume EB
  • Matthew Rocklin
  • Scott Sievert
  • Tom Augspurger
  • mholtzscher