We recently enjoyed the 2020 SciPy conference from the comfort of our own homes this year. The 19th annual Scientific Computing with Python conference was a virtual conference this year due to the global pandemic. The annual SciPy Conference brought together over 1500 participants from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.
As part of the maintainers track we presented an update on Dask.
You can find the video on the SciPy YouTube channel. The Dask update runs from 0:00-19:30.
Here’s a summary of the main topics covered in the talk. You can also check out the original thread on Twitter.
We’ve been trying to gauge the size of our community lately. The best proxy we have right now is the number of weekly visitors to the Dask documentation. Which currently stands at around 10,000.
Dask also came up in the Jetbrains Python developer survey. We were excited to see 5% of all the Python developers who filled out the survey said they use Dask. Which shows health in the PyData community as well as Dask.
We are running our own survey at the moment. If you are a Dask user please take a few minutes to fill it out. We would really appreciate it.
In February we had an in-person Dask Summit where a mixture of OSS maintainers and institutional users met. We had talks and workshops to help figure out our challenges and set our direction.
The Dask community also has a monthly meeting! It is held on the first Thursday of the month at 10:00 US Central Time. If you’re a Dask user you are welcome to come to hear updates from maintainers and share what you’re working on.
There are many projects built on Dask. Looking at the preliminary results from the 2020 Dask survey shows some that are especially popular.
Let’s take a look at each of those.
Xarray allows you to work on multi-dimensional datasets that have supporting metadata arrays in a Pandas-like way.
RAPIDS is an open-source suite of GPU accelerated Python libraries. Using these tools you can execute end-to-end data science and analytics pipelines entirely on GPUs. All using familiar PyData APIs.
BlazingSQL builds on RAPIDS and Dask to provide an open-source distributed, GPU accelerated SQL engine.
While XGBoost has been around for a long time you can now prepare your data on your Dask cluster and then bootstrap your XGBoost cluster on top of Dask and hand the distributed dataframes straight over.
Prefect is a workflow manager which is built on top of Dask’s scheduling engine. “Users organize Tasks into Flows, and Prefect takes care of the rest.”
Iris, part of the SciTools suite of tools, uses the CF data model giving you a format-agnostic interface for working with your data. It excels when working with multi-dimensional Earth Science data, where tabular representations become unwieldy and inefficient.
These are the tools our community have told us they like so far. But if you use something which didn’t make the list then head to our survey and let us know! According to PyPI there are many more out there.
There are many user groups who use Dask. Everything from life sciences, geophysical sciences and beamline facilities to finance, retail and logistics. Check out the great “Who uses Dask?” talk from Matthew Rocklin for more info.
There has been an increase in for-profit companies building tools with Dask. Including Coiled Computing, Prefect and Saturn Cloud.
We’ve also seen large companies like Microsoft’s Azure ML team contributing a cluster manager to Dask Cloudprovider. This helps folks get up and running with Dask on AzureML quicker and easier.
Moving on to recent improvements there has been a lot of work to get Open UCX supported as a protocol in Dask. Which allows worker-worker communication to be accelerated vastly with hardware that supports Infiniband or NVLink.
There have also been some recent announcements around NVIDIA blowing away the TPCx-BB benchmark by outperforming the current leader by 20x. This is a huge success for all the open-source projects that were involved, including Dask.
We’ve seen increased adoption of Dask Gateway. Many institutions are using it as a way to provide their staff with on-demand Dask clusters.
The update that got the most 👏 feedback from the SciPy 2020 attendees was the Cluster Map Plot (known to maintainers as the “pew pew pew” plot). This plot shows a high-level overview of your Dask cluster scheduler and workers and the communication between them.
To wrap up with what Dask is going to be doing next we are going to be continuing to work on high-level graph optimization.
With feedback from our community we are also going to be focussing on making the Dask scheduler more performant. There are a few things happening including a Rust implementation of the scheduler, dynamic task creation and ongoing benchmarking.
Lastly I’m excited to share that with funding from the Chan Zuckerberg Foundation, Dask will be hiring a maintainer who will focus on growing usage in the biological sciences field. If that is of interest to you keep an eye on our twitter account for more announcements.