Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Jul 8, 2018

Dask Development Log


This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current workgoing on around Dask and related projects. Nothing here is ready forproduction. This blogpost is written in haste, so refined polish should not beexpected.

Current efforts for June 2018 in Dask and Dask-related projects includethe following:

  1. Yarn Deployment
  2. More examples for machine learning
  3. Incremental machine learning
  4. HPC Deployment configuration

Yarn deployment

Dask developers often get asked How do I deploy Dask on my Hadoop/Spark/Hivecluster?. We haven’t had a very good answer until recently.

Most Hadoop/Spark/Hive clusters are actually Yarn clusters. Yarn is the mostcommon cluster manager used by many clusters that are typically used to runHadoop/Spark/Hive jobs including any cluster purchased from a vendor likeCloudera or Hortonworks. If your application can run on Yarn then it can be afirst class citizen here.

Unfortunately Yarn has really only been accessible through a Java API, and sohas been difficult for Dask to interact with. That’s changing now with a fewprojects, including:

  • dask-yarn: an easy way to launch Dask onYarn clusters
  • skein: an easy way to launch genericservices on Yarn clusters (this is primarily what backs dask-yarn)
  • conda-pack: an easy way to bundletogether a conda package into a redeployable environment, such as is usefulwhen launching Python applications on Yarn

This work is all being done by Jim Crist who is, Ibelieve, currently writing up a blogpost about the topic at large. Dask-yarnwas soft-released last week though, so people should give it a try and reportfeedback on the dask-yarn issue tracker.If you ever wanted direct help on your cluster, now is the right time becauseJim is working on this actively and is not yet drowned in user requests sogenerally has a fair bit of time to investigate particular cases.

from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GB of memory
cluster = YarnCluster(environment='environment.tar.gz',
# Scale out to ten such workers

# Connect to the cluster
client = Client(cluster)

More examples for machine learning

Dask maintains a Binder of simple examples that show off various ways to usethe project. This allows people to click a link on the web and quickly betaken to a Jupyter notebook running on the cloud. It’s a fun way to quicklyexperience and learn about a new project.

Previously we had a single example for arrays, dataframes, delayed, machinelearning, etc.

Now Scott Sievert is expanding the examples withinthe machine learning section. He has submitted the following two so far:

  1. Incremental training with Scikit-Learn and large datasets
  2. Dask and XGBoost

I believe he’s planning on more. If you usedask-ml and have recommendations orwant to help, you might want to engage in the dask-ml issuetracker or dask-examples issuetracker.

Incremental training

The incremental training mentioned as an example above is also new-ish. Thisis a Scikit-Learn style meta-estimator that wraps around other estimators thatsupport the partial_fit method. It enables training on large datasets in anincremental or batchwise fashion.


from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(...)

import pandas as pd

for filename in filenames:
df = pd.read_csv(filename)
X, y = ...

sgd.partial_fit(X, y)


from sklearn.linear_model import SGDClassifier
from dask_ml.wrappers import Incremental

sgd = SGDClassifier(...)
inc = Incremental(sgd)

import dask.dataframe as dd

df = dd.read_csv(filenames)
X, y = ..., y)


From a parallel computing perspective this is a very simple and un-sexy way ofdoing things. However my understanding is that it’s also quite pragmatic. Ina distributed context we leave a lot of possible computation on the table (thesolution is inherently sequential) but it’s fun to see the model jump aroundthe cluster as it absorbs various chunks of data and then moves on.

Incremental training with Dask-ML

There’s ongoing work on how best to combine this with other work like pipelinesand hyper-parameter searches to fill in the extra computation.

This work was primarily done by Tom Augspurgerwith help from Scott Sievert

Dask User Stories

Dask developers are often asked “Who uses Dask?”. This is a hard question toanswer because, even though we’re inundated with thousands of requests forhelp from various companies and research groups, it’s never fully clear whominds having their information shared with others.

We’re now trying to crowdsource this information in a more explicit way byhaving users tell their own stories. Hopefully this helps other users in theirfield understand how Dask can help and when it might (or might not) be usefulto them.

We originally collected this information in a GoogleForm but have since then moved it to aGithub repository. Eventuallywe’ll publish this as a proper website and include it in ourdocumentation.

If you use Dask and want to share your story this is a great way to contributeto the project. Arguably Dask needs more help with spreading the word than itdoes with technical solutions.

HPC Deployments

The Dask Jobqueue package fordeploying Dask on traditional HPC machines is nearing another release. We’vechanged around a lot of the parameters and configuration options in order toimprove the onboarding experience for new users. It has been going verysmoothly in recent engagements with new groups, but will mean a breakingchange for existing users of the sub-project.