This work is supported by Continuum Analyticsthe XDATA Programand the Data Driven Discovery Initiative from the MooreFoundation
Dask has been active lately due to a combination of increased adoption andfunded feature development by private companies. This increased activityis great, however an unintended side effect is that I have spent less timewriting about development and engaging with the broader community. To addressthis I hope to write one blogpost a week about general development. These willnot be particularly polished, nor will they announce ready-to-use features forusers, however they should increase transparency and hopefully better engagethe developer community.
So themes of last week
The distributed scheduler’s web diagnosticpage is one of Dask’smore flashy features. It shows the passage of every computation on the clusterin real time. These diagnostics are invaluable for understanding performanceboth for users and for core developers.
I intend to focus on worker performance soon, so I decided to attach a Bokehserver to every worker to serve web diagnostics about that worker. To makethis easier, I also learned how to embed Bokeh servers inside of otherTornado applications. This has reduced the effort to create new visuals andexpose real time information considerably and I can now create a full livevisualization in around 30 minutes. It is now faster for me to builda new diagnostic than to grep through logs. It’s pretty useful.
Here are some screenshots. Nothing too flashy, but this information is highlyvaluable to me as I measure bandwidths, delays of various parts of the code,how workers send data between each other, etc..
To be clear, these diagnostic pages aren’t polished in any way. There’s lotsmissing, it’s just what I could get done in a day. Still, everyone running aTornado application should have an embedded Bokeh server running. They’regreat for rapidly pushing out visually rich diagnostics.
Previously the scheduler knew everything and the workers were fairlysimple-minded. Now we’ve moved some of the knowledge and responsibility overto the workers. Previously the scheduler would give just enough work to theworkers to keep them occupied. This allowed the scheduler to make betterdecisions about the state of the entire cluster. By delaying committing a taskto a worker until the last moment we made sure that we were making the rightdecision. However, this also means that the worker sometimes has idleresources, particularly network bandwidth, when it could be speculativelypreparing for future work.
Now we commit all ready-to-run tasks to a worker immediately and that workerhas the ability to pipeline those tasks as it sees fit. This is better locallybut slightly worse globally. To counter balance this we’re now being much moreaggressive about work stealing and, because the workers have more information,they can manage some of the administrative costs of works stealing themselves.Because this isn’t bound to run on just the scheduler we can use more expensivealgorithms than when when did everything on the scheduler.
There were a few motivations for this change:
Approximate nunique andmultiple-output-partition groupbys landed in master last week. These arosebecause some power-users had very large dataframes that weree running intoscalability limits. Thanks to Mike Graham for the approximate nuniquealgorithm. This has also pushed hashingchanges upstream to Pandas.
Martin Durant has been working on a Parquet reader/writer for Python usingNumba. It’s pretty slick. He’s been using it on internal Continuum projectsfor a little while and has seen both good performance and a very Pythonicexperience for what was previously a format that was pretty inaccessible.
He’s planning to write about this in the near future so I won’t steal histhunder. Here is a link to the documentation:fastparquet.readthedocs.io