To increase transparency I’m blogging weekly about the work done on Dask andrelated projects during the previous week. This log covers work done between2016-12-11 and 2016-12-18. Nothing here is ready for production. Thisblogpost is written in haste, so refined polish should not be expected.
Themes of last week:
The last two weeks saw several disruptive changes to the scheduler and workers.This resulted in an overall performance degradation on messy workloads whencompared to the most recent release, which stopped bleeding-edge users fromusing recent dev builds. This has been resolved, and bleeding-edge git-masteris back up to the old speed and then some.
As a visual aid, this is what bad (or in this case random) load balancing lookslike:
For a while there have been significant gaps of 100ms or more between successivetasks in workers, especially when using Pandas. This was particularly oddbecause the workers had lots of backed up work to keep them busy (thanks to thenice load balancing from before). The culprit here was the calculation of thesize of the intermediate on object dtype dataframes.
Explaining this in greater depth, recall that to schedule intelligently, theworkers calculate the size in bytes of every intermediate result they produce.Often this is quite fast, for example for numpy arrays we can just multiply thenumber of elements by the dtype itemsize. However for object dtype arrays ordataframes (which are commonly used for text) it can take a long while tocalculate an accurate result here. Now we no longer calculuate an accurateresult, but instead take a fairly pessimistic guess. The gaps between tasksshrink considerably.
Although there is still a significant bit of lag around 10ms long between taskson these workloads (see zoomed version on the right). On other workloads we’reable to get inter-task lag down to the tens of microseconds scale. While 10msmay not sound like a long time, when we perform very many very short tasks thiscan quickly become a bottleneck.
Anyway, this change reduced shuffle overhead by a factor of two. Things arestarting to look pretty snappy for many-small-task workloads.
I would like to run a small benchmark comparing Dask and Spark DataFrames. Ispent a bit of the last couple of days using Spark locally on the NYC Taxi dataand futzing with cluster deployment tools to set up Spark clusters on EC2 forbasic benchmarking. I ran acrossflintrock, which has been highlyrecommended to me a few times.
I’ve been thinking about how to do benchmarks in an unbiased way. Comparativebenchmarks are useful to have around to motivate projects to grow and learnfrom each other. However in today’s climate where open source softwaredevelopers have a vested interest, benchmarks often focus on a projects’strengths and hide their deficiencies. Even with the best of intentions andpractices, a developer is likely to correct for deficiencies on the fly.They’re much more able to do this for their own project than for others’.Benchmarks end up looking more like sales documents than trustworthy research.
My tentative plan is to reach out to a few Spark devs and see if we cancollaborate on a problem set and hardware before running computations andcomparing results.
Rich Postelnik is building on work fromTom Augspurger to build out benchmarks forDask using airspeed velocity atdask-benchmarks. Building outbenchmarks is a great way to get involved if anyone is interested.
I intend to publish a pre-release for a 0.X.0 version bump of dask/dask anddask/distributed sometime next week.