There is a lot of work happening in Dask right now on high level graphs. We’d like to share a snapshot of current work in this area. This post is for people interested in technical details of behind the scenes work improving performance in Dask. You don’t need to know anything about it in order to use Dask.
High level graphs are a more compact representation of instructions needed to generate the full low level task graph.The documentation page on Dask high level graphs is here: https://docs.dask.org/en/latest/high-level-graphs.html
High level graphs are useful for faster scheduling.Instead of sending very large task graphs between the scheduler and the workers, we can instead send the smaller high level graph representation to the worker. Reducing the amount of data that needs to be passed around allows us to improve the overall performance.
No, you won’t need to change anything. This work is being done under the hood in Dask, and you should see some speed improvements without having to change anything in your code.
In fact, you might already be benefitting from high level graphs:
“Starting with Dask 2021.05.0, Dask DataFrame computations will start sending HighLevelGraph’s directly from the client to the scheduler by default. Because of this, users should observe a much smaller delay between when they call .compute() and when the corresponding tasks begin running on workers for large DataFrame computations” https://coiled.io/dask-heartbeat-by-coiled-2021-06-10/
Read on for a snapshot of progress in other areas.
The Blockwise high level graph layer was introduced in the 2020.12.0 Dask release. Since then, there has been a lot of effort made to use Blockwise high level graph layer where ever possible for improved performance, most especially for IO operations. The following is a non-exhaustive list.
Highlights include (in no particular order):
Lots of other work with Blockwise is currently in progress:
Investigating a high level graph for Dask’s map_overlap is a project driven by user needs. People have told us that the time taken just to generate the task graph (before any actual computation takes place) can sometimes be a big user experience problem. So, we’re looking in to ways to improve it.
This PR defers much of the computation involved in creating the Dask task graph, but does not does not reduce the total end-to-end computation time. Further optimization is therefore required.
Followup work includes:
Profiling map_overlap, we saw that a lot of time is being spent in slicing operations. So, slicing was a logical next step to investigate possible performance improvements with high level graphs.
Meanwhile, Rick Zamora has been working on the dataframe side of Dask, using high level graphs to improve dataframe slicing/selections.
A couple of minor bugfixes/improvements:
We’ve also put some work into making better visualizations for Dask objects (including high level graphs).
Defining a _repr_html_ method for your classes is a great way to get nice HTML output when you’re working with jupyter notebooks. You can read this post to see more neat HTML representations in other scientific python libraries.
Dask already uses HTML representations in lots of places (like the Array and Dataframe classes). We now have new HTML representations for HighLevelGraph and Layer objects, as well as Scheduler and Client objects in Dask distributed.
This gives us a much more meaningful representation, and is already being used by developers working on high level graphs.
Finally, the documentation around high level graphs is sparse. This is because they’re relatively new, and have also been undergoing quite a bit of change. However, this makes it difficult for people. We’re planning to improve the documentation, for both users and devlopers of Dask.
If you’d like to follow these discussions, or help out, you can subscribe to the issues: