Dask wants to better support the needs of life scientists. We’ve been getting to know the community, in order to better understand:
We’ve learned that:
Our strategic plan for this year involves three parallel streams:
If you still want to have your say, it’s not too late -click this link to get in touch!
Recently Dask won some funding to hire a developer (Genevieve Buckley) to improve Dask specifically for life sciences.
Working with scientists is a really great way to drive growth in open source projects. Both scientists and software developers benefit. Early on, Dask had a lot of success integrating with the geosciences community. It’d be great to see similar success for life sciences too.
There are several areas of life science where we see Dask being used today:
We’ve solicited feedback from the life science community, to come up with a strategic plan to direct our effort over the next year.
When we talked to individual Dask users, we heard a lot of similar themes in their comments.
People wanted:
The most common request was for better documentation with more examples. People across many different areas of life science all said this could help them a lot. A corresponding challenge here is the multitude of different areas of life science, all of which require targeted documentation.
GPU support was also commonly mentioned. Comments about GPUs fit into two of the categories above: GPU memory is often a constraint, and life scientists also want it to be easier to apply deep learning models to their data.
We didn’t only talk with individual users of Dask. We also spoke to developers of scientific software projects.
Software projects wanted to solve problems related to:
Dask is good at solving those kinds of problems, and might be a good solution for this.
Some of the software projects we spoke to include:
napari is a python based image viewer. Dask is already well-integrated with napari. Areas for opportunity here include:
sgkit is a statistical genetics toolkit. Dask is already well-integrated with sgkit. The developers would like improved infrastructure in the main Dask repositories that they can benefit from. Wishlist items include:
scanpy is a library for single cell analysis in Python. It is built together with anndata, an annotated data structure.
squidpy is a tool for the analysis and visualization of spatial molecular data. It builds on top of scanpy and anndata. Because squidpy involves large imaging data on top of what we’d normally see for datasets in scanpy/anndata, this is a project with a large area of opportunity for Dask.
ilastik does not currently use Dask at all. They are curious to see if Dask can make it easier to scale up from a single machine to a cluster.Users generally train an ilastik model interactively, and then want to apply it to many images. This second step is often when people want an easy way to scale up the computing resources available.
CellProfiler is a pipeline tool for image processing. They have briefly experimented with Dask before.
Because large scientific software projects have many users, improvements here would be high value for the scientific community. This is a huge area of opportunity. We plan to collaborate with these developer communities as much as possible to drive this forward.
Another area of opportunity is improving infrastructure for high level graph visualizations. Power users and novices alike would benefit from better tools for identifying areas of inefficiencies in Dask computations.
Finally, continuing to build support for Dask arrays with non-numpy chunks is also a high impact area of opportunity. In particular, support for sparse arrays, and support for arrays on the GPU were highlighted as very important to the life science community.
We’re going to manage this project with three parallel streams:
Each stream will likely have one primary project at any time, with many more queued. Within each stream, proposed projects will be ranked according to: level of impact, time commitment required, and the availability of other developer resources.
Infrastructure projects are improvements to either:
We’ll aim to spend around 60% of project effort on infrastructure.
Outreach activities include blogposts, talks, webinars, tutorials, and creating examples for documentation. We aim to spend around 20% of project effort on outreach.
If you have outreach ideas you want to share (perhaps you run a student group or popular meetup) then you can get in touch with us here.
The final stream focusses on the application of Dask to a specific problem in life science.
These projects generally involve collaborating with individual labs or group, and have an end goal of summarizing their workflow in a blogpost. This feeds back into our outreach, so others in the community can learn from it.
Ideally these are short term projects, so we can showcase many different applications of Dask. We aim to spend around 20% of project effort on applications.
If you use Dask and have an example in mind you’d like to share, then you can get in touch with us here.
The role of Dask Life Science Fellow has a very broad scope, so there are a lot of different ways we could be successful within this space.
Some indicators of success are:
We won’t have the time or the resources to do all the things, but we will be able to make an impact by focussing on a subset.
The information we discovered talking to the life science community is likely to be biased in a few different ways.
My (Genevieve’s) network is strongest among imaging scientists, and among people in Australia. It’s much less strong for other fields in life science, as my original training is in physics.
The Dask project has strong links to other open source python projects, including scientific software. The Dask developer community also has strong links from companies including NVIDIA, Quansight, and others. They are likely to be over-represented among the people we spoke to.
It’s much harder to find people who aren’t using Dask at all yet but have problems that would be a good fit for it. These people are very unlikely to be, say following Dask on twitter, and probably won’t be aware that we’re looking for them.
I don’t think there are any perfect solutions to these problems.We’ve tried to mitigate these effects by using loose second and third degree connections to spread awareness, as well as posting in science public forums.
We used a variety of approaches to gather feedback from the life science community.
Come join us in the Dask slack! We have a #life-science channel so there’s a place to discuss things relevant to the Dask life science community. You can request an invite to the Slack here.