Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Dec 18, 2016

Dask Development Log


This work is supported by Continuum Analyticsthe XDATA Programand the Data Driven Discovery Initiative from the MooreFoundation

To increase transparency I’m blogging weekly about the work done on Dask andrelated projects during the previous week. This log covers work done between2016-12-11 and 2016-12-18. Nothing here is ready for production. Thisblogpost is written in haste, so refined polish should not be expected.

Themes of last week:

  1. Benchmarking new scheduler and worker on larger systems
  2. Kubernetes and Google Container Engine
  3. Fastparquet on S3

Rewriting Load Balancing

In the last two weeks we rewrote a significant fraction of the worker andscheduler. This enables future growth, but also resulted in a loss of our loadbalancing and work stealing algorithms (the old one no longer made sense in thecontext of the new system.) Careful dynamic load balancing is essential torunning atypical workloads (which are surprisingly typical among Dask users) sorebuilding this has been all-consuming this week for me personally.

Briefly, Dask initially assigns tasks to workers taking into account theexpected runtime of the task, the size and location of the data that the taskneeds, the duration of other tasks on every worker, and where each piece of datasits on all of the workers. Because the number of tasks can grow into themillions and the number of workers can grow into the thousands, Dask needs tofigure out a near-optimal placement in near-constant time, which is hard.Furthermore, after the system runs for a while, uncertainties in our estimatesbuild, and we need to rebalance work from saturated workers to idle workersrelatively frequently. Load balancing intelligently and responsively isessential to a satisfying user experience.

We have a decently strong test suite around these behaviors, but it’s hard tobe comprehensive on performance-based metrics like this, so there has also beena lot of benchmarking against real systems to identify new failure modes.We’re doing what we can to create isolated tests for every failure mode that wefind to make future rewrites retain good behavior.

Generally working on the Dask distributed scheduler has taught me thebrittleness of unit tests. As we have repeatedly rewritten internals whilemaintaining the same external API our testing strategy has evolved considerablyaway from fine-grained unit tests to a mixture of behavioral integration testsand a very strict runtime validation system.

Rebuilding the load balancing algorithms has been high priority for mepersonally because these performance issues inhibit current power-users fromusing the development version on their problems as effectively as with thelatest release. I’m looking forward to seeing load-balancing humming nicelyagain so that users can return to git-master and so that I can return tohandling a broader base of issues. (Sorry to everyone I’ve been ignoring thelast couple of weeks).

Test deployments on Google Container Engine

I’ve personally started switching over my development cluster from Amazon’s EC2to Google’s Container Engine. Here are some pro’s and con’s from my particularperspective. Many of these probably have more to do with how I use eachparticular tool rather than intrinsic limitations of the service itself.

In Google’s Favor

  1. Native and immediate support for Kubernetes and Docker, the combination ofwhich allows me to more quickly and dynamically create and scale clustersfor different experiments.
  2. Dynamic scaling from a single node to a hundred nodes and back ten minuteslater allows me to more easily run a much larger range of scales.
  3. I like being charged by the minute rather than by the hour, especiallygiven the ability to dynamically scale up
  4. Authentication and billing feel simpler

In Amazon’s Favor

  1. I already have tools to launch Dask on EC2
  2. All of my data is on Amazon’s S3
  3. I have nice data acquisition tools,s3fs, for S3 based on boto3.Google doesn’t seem to have a nice Python 3 library for accessing GoogleCloud Storage :(

I’m working from Olivier Grisel’s repositorydocker-distributed althoughupdating to newer versions and trying to use as few modifications from naivedeployment as possible. My current branch ishere. I hope tohave something more stable for next week.

Fastparquet on S3

We gave fastparquet and Dask.dataframe a spin on some distributed S3 data onFriday. I was surprised that everything seemed to work out of the box. MartinDurant, who built both fastparquet and s3fs has done some nice work to makesure that all of the pieces play nicely together. We ran into some performanceissues pulling bytes from S3 itself. I expect that there will be some tweakingover the next few weeks.