This work is supported by Continuum Analyticsthe XDATA Programand the Data Driven Discovery Initiative from the MooreFoundation

To increase transparency I’m blogging weekly about the work done on Dask andrelated projects during the previous week. This log covers work done between2017-01-01 and 2016-01-17. Nothing here is ready for production. Thisblogpost is written in haste, so refined polish should not be expected.

Themes of the last couple of weeks:

Stability enhancements for the distributed scheduler and micro-release
NASA Grant writing
Dask-EC2 script
Dataframe categorical flexibility (work in progress)
Communication refactor (work in progress)

Stability enhancements and micro-release

We’ve released dask.distributed version 1.15.1, which includes importantbugfixes after the recent 1.15.0 release. There were a number of small issuesthat coordinated to remove tasks erroneously. This was generally OKbecause the Dask scheduler was able to heal the missing pieces (using thesame machinery that makes Dask resilience) and so we didn’t notice the flawuntil the system was deployed in some of the more serious Dask deployments inthe wild.PR dask/distributed #804contains a full writeup in case anyone is interested. The writeup ends withthe following line:

This was a nice exercise in how coupling mostly-working components can easilyyield a faulty system.

This also adds other fixes, like a compatibility issue with the new Bokeh0.12.4 release and others.

NASA Grant Writing

I’ve been writing a proposal to NASA to help fund distributed Dask+XArray workfor atmospheric and oceanographic science at the 100TB scale. Many thanks toour scientific collaborators who are offering support here.

Dask-EC2 startup

The Dask-EC2 project deploys Anaconda, aDask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2)with a small command line interface:

pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
--keypair /path/to/ssh-key \
--type m4.2xlarge
--count 8

This project can be either very useful for people just getting startedand for Dask developers when we run benchmarks, or it can be horribly brokenif AWS or Dask interfaces change and we don’t keep this project maintained.Thanks to a great effort from Ben Zaitlen`dask-ec2 is again in the very useful state, where I’m hoping it will stayfor some time.

If you’ve always wanted to try Dask on a real cluster and if you already haveAWS credentials then this is probably the easiest way.

This already seems to be paying dividends. There have been a few unrelatedpull requests from new developers this week.

Dataframe Categorical Flexibility

Categoricals can significantly improveperformance ontext-based data. Currently Dask’s dataframes support categoricals, but theyexpect to know all of the categories up-front. This is easy if this set issmall, like the ["Healthy", "Sick"] categories that might arise in medicalresearch, but requires a full dataset read if the categories are not knownahead of time, like the names of all of the patients.

Jim Crist is changing this so that Dask canoperates on categorical columns with unknown categories at dask/dask#1877. The constituent pandasdataframes all have possibly different categories that are merged as necessary.This distinction may seem small, but it limits performance in a surprisingnumber of real-world use cases.

Communication Refactor

Since the recent worker refactor and optimizations it has become clear thatinter-worker communication has become a dominant bottleneck in some intensiveapplications. Antoine Pitrou is currentlyrefactoring Dask’s network communication layer,making room for more communication options in the future. This is an ambitiousproject. I for one am very happy to have someone like Antoine looking intothis.

Submit New Event

Submit News Feature

Dask Development Log

Stability enhancements and micro-release

NASA Grant Writing

Dask-EC2 startup

Dataframe Categorical Flexibility

Communication Refactor

Submit New Event

Submit News Feature

Sign up for Newsletter

Dask Development Log

Stability enhancements and micro-release

NASA Grant Writing

Dask-EC2 startup

Dataframe Categorical Flexibility

Communication Refactor