This work is supported by Continuum Analyticsthe XDATA Programand the Data Driven Discovery Initiative from the MooreFoundation
To increase transparency I’m blogging weekly about the work done on Dask andrelated projects during the previous week. This log covers work done between2016-12-11 and 2016-12-18. Nothing here is ready for production. Thisblogpost is written in haste, so refined polish should not be expected.
Themes of last week:
In the last two weeks we rewrote a significant fraction of the worker andscheduler. This enables future growth, but also resulted in a loss of our loadbalancing and work stealing algorithms (the old one no longer made sense in thecontext of the new system.) Careful dynamic load balancing is essential torunning atypical workloads (which are surprisingly typical among Dask users) sorebuilding this has been all-consuming this week for me personally.
Briefly, Dask initially assigns tasks to workers taking into account theexpected runtime of the task, the size and location of the data that the taskneeds, the duration of other tasks on every worker, and where each piece of datasits on all of the workers. Because the number of tasks can grow into themillions and the number of workers can grow into the thousands, Dask needs tofigure out a near-optimal placement in near-constant time, which is hard.Furthermore, after the system runs for a while, uncertainties in our estimatesbuild, and we need to rebalance work from saturated workers to idle workersrelatively frequently. Load balancing intelligently and responsively isessential to a satisfying user experience.
We have a decently strong test suite around these behaviors, but it’s hard tobe comprehensive on performance-based metrics like this, so there has also beena lot of benchmarking against real systems to identify new failure modes.We’re doing what we can to create isolated tests for every failure mode that wefind to make future rewrites retain good behavior.
Generally working on the Dask distributed scheduler has taught me thebrittleness of unit tests. As we have repeatedly rewritten internals whilemaintaining the same external API our testing strategy has evolved considerablyaway from fine-grained unit tests to a mixture of behavioral integration testsand a very strict runtime validation system.
Rebuilding the load balancing algorithms has been high priority for mepersonally because these performance issues inhibit current power-users fromusing the development version on their problems as effectively as with thelatest release. I’m looking forward to seeing load-balancing humming nicelyagain so that users can return to git-master and so that I can return tohandling a broader base of issues. (Sorry to everyone I’ve been ignoring thelast couple of weeks).
I’ve personally started switching over my development cluster from Amazon’s EC2to Google’s Container Engine. Here are some pro’s and con’s from my particularperspective. Many of these probably have more to do with how I use eachparticular tool rather than intrinsic limitations of the service itself.
In Google’s Favor
In Amazon’s Favor
I’m working from Olivier Grisel’s repositorydocker-distributed althoughupdating to newer versions and trying to use as few modifications from naivedeployment as possible. My current branch ishere. I hope tohave something more stable for next week.
We gave fastparquet and Dask.dataframe a spin on some distributed S3 data onFriday. I was surprised that everything seemed to work out of the box. MartinDurant, who built both fastparquet and s3fs has done some nice work to makesure that all of the pieces play nicely together. We ran into some performanceissues pulling bytes from S3 itself. I expect that there will be some tweakingover the next few weeks.