Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Jul 17, 2018

Dask Development Log, Scipy 2018

By

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current workgoing on around Dask and related projects. Nothing here is ready forproduction. This blogpost is written in haste, so refined polish should not beexpected.

Last week many Dask developers gathered for the annual SciPy 2018 conference.As a result, very little work was completed, but many projects were started ordiscussed. To reflect this change in activity this blogpost will highlightpossible changes and opportunities for readers to further engage indevelopment.

Dask on HPC Machines

The dask-jobqueue project was a hit atthe conference. Dask-jobqueue helps people launch Dask on traditional jobschedulers like PBS, SGE, SLURM, Torque, LSF, and others that are commonlyfound on high performance computers. These are very common among scientific,research, and high performance machine learning groups but commonly a bit hardto use with anything other than MPI.

This project came up in the Pangeo talk,lightning talks, and the Dask Birds of a Feather session.

During sprints a number of people came up and we went through the process ofconfiguring Dask on common supercomputers like Cheyenne, Titan, and Cori. Thisprocess usually takes around fifteen minutes and will likely be the subject ofa future blogpost. We published known-good configurations for these clusterson our configuration documentation

Additionally, there is a JupyterHubissue to improvedocumentation on best practices to deploy JupyterHub on these machines. Thecommunity has done this well a few times now, and it might be time to write upsomething for everyone else.

Get involved

If you have access to a supercomputer then please try things out. There is a30-minute Youtube video screencast on thedask-jobqueue documentation that shouldhelp you get started.

If you are an administrator on a supercomputer you might consider helping tobuild a configuration file and place it in /etc/dask for your users. Youmight also want to get involved in the JupyterHub onHPCconversation.

Dask / Scikit-learn talk

Olivier Grisel and Tom Augspurger prepared and delivered a great talk on thecurrent state of the new Dask-ML project.

MyBinder and Bokeh Servers

Not a Dask change, but Min Ragan-Kelley showed how to run services throughmybinder.org that are not only Jupyter. As an example,here is a repository that deploys a Bokeh server application with a singleclick.

I think that by composing with Binder Min effectively just created thefree-to-use hosted Bokeh server service. Presumably this same model could beeasily adapted to other applications just as easily.

Dask and Automated Machine Learning with TPOT

Dask and TPOT developers are discussing paralellizing theautomatic-machine-learning tool TPOT.

TPOT uses genetic algorithms to search over a space of scikit-learn stylepipelines to automatically find a decently performing pipeline and model. Thisinvolves a fair amount of computation which Dask can help to parallelize out tomultiple machines.

Get involved

Trivial things work now, but to make this efficient we’ll need to dive in a bitmore deeply. Extending that pull request to dive within pipelines would be agood task if anyone wants to get involved. This would help to shareintermediate results between pipelines.

Dask and Scikit-Optimize

Among various features, Scikit-optimizeoffers a BayesSearchCVobject that is like Scikit-Learn’s GridSearchCV and RandomSearchCV, but is abit smarter about how to choose new parameters to test given previous results.Hyper-parameter optimization is a low-hanging fruit for Dask-ML workloads today,so we investigated how the project might help here.

So far we’re just experimenting using Scikit-Learn/Dask integration throughjoblib to see what opportunities there are. Dicussion among Dask andScikit-Optimize developers is happening here:

Centralize PyData/Scipy tutorials on Binder

We’re putting a bunch of the PyData/Scipy tutorials on Binder, and hope toembed snippets of Youtube videos into the notebooks themselves.

This effort lives here:

Motivation

The PyData and SciPy community delivers tutorials as part of most conferences.This activity generates both educational Jupyter notebooks and explanatoryvideos that teach people how to use the ecosystem.

However, this content isn’t very discoverable after the conference. Peoplecan search on Youtube for their topic of choice and hopefully find a link tothe notebooks to download locally, but this is a somewhat noisy process. It’snot clear which tutorial to choose and it’s difficult to match up the videowith the notebooks during exercises.We’re probably not getting as much value out of these resources as we could be.

To help increase access we’re going to try a few things:

  1. Produce a centralized website with links to recent tutorials delivered foreach topic
  2. Ensure that those notebooks run easily on Binder
  3. Embed sections of the talk on Youtube within each notebook so that theexplanation of the section is tied to the exercises

Get involved

This only really works long-term under a community maintenance model. So farwe’ve only done a few hours of work and there is still plenty to do in thefollowing tasks:

  1. Find good tutorials for inclusion
  2. Ensure that they work well on mybinder.org
  3. are self-contained and don’t rely on external scripts to run
  4. have an environment.yml or requirements.txt
  5. don’t require a lot of resources
  6. Find video for the tutorial
  7. Submit a pull request to the tutorial repository that embeds a link to theyoutube talk at the top cell of the notebook at the proper time for eachnotebook

Dask, Actors, and Ray

I really enjoyed the talk on Ray anotherdistributed task scheduler for Python. I suspect that Dask will steal ideasfor actors for stateful operation.I hope that Ray takes on ideas for using standard Python interfaces so thatmore of the community can adopt it more quickly. I encourage people to checkout the talk and give Ray a try. It’s pretty slick.

Planning conversations for Dask-ML

Dask and Scikit-learn developers had the opportunity to sit down again andraise a number of issues to help plan near-term development. This focusedmostly around building important case studies to motivate future development,and identifying algorithms and other projects to target for near-termintegration.

Case Studies

Algorithms

Get involved

We could use help in building out case studies to drive future development inthe project. There are also several algorithmic places to get involved.Dask-ML is a young and fast-moving project with many opportunities for newdevelopers to get involved.

Dask and UMAP for low-dimensional embeddings

Leland McKinnes gave a great talk Uniform Manifold Approximation andProjection for Dimensionality Reduction in whichhe lays out a well founded algorithm for dimensionality reduction, similar toPCA or T-SNE, but with some nice properties. He worked together with some Daskdevelopers where we identified some challenges due to dask array slicing withrandom-ish slices.

A proposal to fix this problem lives here, if anyone wants a fun problem to work on:

Dask stories

We soft-launched Dask Storiesa webpage and project to collect user and share stories about how people useDask in practice. We’re also delivering a separate blogpost about this today.

See blogpost: Who uses Dask?

If you use Dask and want to share your story we would absolutely welcome yourexperience. Having people like yourself share how they use Dask is incrediblyimportant for the project.