Genevieve Buckley was hired as a Dask Life Science Fellow in 2021 funded by CZI. The goal was to improve Dask, with a specific focus on the life science community. This blogpost contains another progress update, and some personal reflections looking back over this year.
A previous progress update for February to September 2021 is available here. Read on for a progress update for the period September to December 2021.
To summarize, between September and December 2021 inclusive, there were:
Read on for a more detailed description of special projects within this time.
Dask stale issues sprint
In two weeks I was able to:
Lots of other people did work around the same time, following up on old pull requests and other maintanence work. The sprint was very successful overall.
Dask user survey results analysis
In September I analyzed the results from the 2021 Dask user survey.This was a really fun task. Because we asked a lot more questions in 2021 (18 new questions, 43 questions in total) there was was a lot more data to dig into, compared with previous years. You can read the full details about it here.
The biggest benefit from this work is that now we can use this data to prioritize improvements to the documentation and examples.The top two user requests are for more documentation and more examples from their industry. But it wasn’t until this year that we started asking what industries people worked in, so we can target new narrative documentation to the areas that need it most (geoscience, life science, and finance).
ITK compatibility with Dask
I implemented pickle serialization for itk images (ITK PR #2829). This should be one of the last major pieces of the puzzle needed to make ITK images compatible with Dask. It builds on earlier work by Matt McCormick and John Kirkham (you can read a blog post about their earlier work here).
Better cross-compatibility for Dask with other projects was a major goal of mine, so this is an important piece of work. I outline the next steps in the section What’s next in Dask?
Improve rechunking
I implemented PR #8124 fix a bug where reshaping a Dask array can cause an output array with chunks that are much too large to fit in memory.Feedback from the life science user survey indicates that improving Dask’s performance around rechunking is a priority. This work helps to address that.
High level graph work
A major piece of work earlier this year was introducing high level graphs for array slicing and array overlap operations. That is a big effort requiring a lot of ongoing work.PR #8467 tackles one of the next steps for this work.
Find objects function for dask-image
I implemented a find_objects function for dask-image in PR #240. This implementation does not need to know the maximum label number ahead of time, a subtantial improement over the previous attempt. This is a major step forward, because it removes a major blocker to introducing scikit-image like regionprops functionality.
Blogposts
Dask blogposts published between September through to December 2021 include:
Tutorials
Reflecting back over the whole year, there were some things that worked well and some things that were less successful.
My personal highlights include:
Dask stale issues sprint
Community building events
We had a very successful year in terms of community building and events. This included tutorials, workshops, conferences, and community outreach. Summary of major events:
Visualization work
This has been very high impact work, and I’m pleased with what we’ve achieved. Improved tools for visualization were requested by users in our survey of the life science community. This was a high priority, because improvements to visuzliation tools benefit EVERYONE who uses Dask.
Technical resources
We never really solved the problem of finding someone I could go to with technical questions. I did have people to ask about some specific projects, but in most cases I didn’t have a good way to direct questions to the right people. This is a challenging problem, especially because most Dask maintainers and contributors have full time jobs doing other things too. In my opinion, this negatively impacted the work and what we were able to achieve.
Being added to the @dask/maintenance team
There’s no point getting notifications if you don’t have GitHub permissions to do anything about them. In future I think we should add only people with at least triage or write permissions to the github teams.
Real time interaction
Slack
Slack works well to DM specific people to set up meeting times, etc, but the public channels didn’t end up being very useful for me personally.
Lack of integration with other project teams
You can only get so much done as a solo developer. We had hoped that I would naturally end up working with teams from several different projects, but this didn’t really end up being the case. The napari project is an exception to this, and that relationship was well established before starting work for Dask. Perhaps there’s something more we could have done here to facilitate more interaction.
Genevieve will be starting a new job next year, you can find her on GitHub @GeneviveeBuckley.
Lots of stuff has happened in Dask, but there is still lots left to do.Here is a summary of the next steps for several projects. We’d love it if new people would like to take up the torch and contribute to any of these projects.
ITK image compatibility with Dask
Improving performance around rechunking
More performance improvements related to rechunking is required (see #7950 and #7980).
High level graph work for arrays and slicing
The high level graph work for slicing and overlapping arrays has a number of next steps.Ian Rose has written an excellent summary here. Briefly, the cull and get_output_keys methods must be implemented, then low level fusion and optimizations can be done.
Relevant links:
Documentation