This notebook presents the results of the 2019 Dask User Survey,which ran earlier this summer. Thanks to everyone who took the time to fill out the survey!These results help us better understand the Dask community and will guide future development efforts.
The raw data, as well as the start of an analysis, can be found in this binder:
Let us know if you find anything in the data.
We had 259 responses to the survey. Overall, we found that the survey respondents really care about improved documentation, and ease of use (including ease of deployment), and scaling. While Dask brings together many different communities (big arrays versus big dataframes, traditional HPC users versus cloud-native resource managers), there was general agreement in what is most important for Dask.
Now we’ll go through some individual items questions, highlighting particularly interesting results.
For learning resources, almost every respondent uses the documentation.
Most respondents use Dask at least occasionally. Fortunately we had a decent number of respondents who are just looking into Dask, yet still spent the time to take the survey.
I’m curiuos about how learning resource usage changes as users become more experienced. We might expect those just looking into Dask to start with examples.dask.org, where they can try out Dask without installing anything.
Overall, documentation is still the leader across user user groups.
The usage of the Dask tutorial and the dask examples are relatively consistent across groups. The primary difference between regular and new users is that regular users are more likely to engage on GitHub.
From StackOverflow questions and GitHub issues, we have a vague idea about which parts of the library are used.The survey shows that (for our respondents at least) DataFrame and Delayed are the most commonly used APIs.
About 65.49% of our respondests are using Dask on a Cluster.
But the majority of respondents also use Dask on their laptop.This highlights the importance of Dask scaling down, either forprototyping with a LocalCluster, or for out-of-core analysisusing LocalCluster or one of the single-machine schedulers.
Most respondents use Dask interactively, at least some of the time.
Most repondents thought that more documentation and examples would be the most valuable improvements to the project. This is especially pronounced among new users. But even among those using Dask everyday more people thought that “More examples” is more valuable than “New features” or “Performance improvements”.
Normalized by row. Darker means that a higher proporiton of users with that usage frequency prefer that priority. Which would help you most right now? Bug fixes More documentation More examples in my field New features Performance improvements How often do you use Dask? Every day 9 11 25 22 23 Just looking for now 1 3 18 9 5 Occasionally 14 27 52 18 15
Perhaps users of certain dask APIs feel differenlty from the group as a whole? We perform a similar analysis grouped by API use, rather than frequency of use.
Normalized by row. Darker means that a higher proporiton of users of that API prefer that priority. Which would help you most right now? Bug fixes More documentation More examples in my field New features Performance improvements Dask APIs Array 10 24 62 15 25 Bag 3 11 16 10 7 DataFrame 16 32 71 39 26 Delayed 16 22 55 26 27 Futures 12 9 25 20 17 ML 5 11 23 11 7 Xarray 8 11 34 7 9
Nothing really stands out. The “futures” users (who we expect to be relatively advanced) may prioritize features and performance over documentation. But everyone agrees that more examples are the highest priority.
For specific features, we made a list of things that we (as developers) thought might be important.
The clearest standout is how many people thought “Better NumPy/Pandas support” was “most critical”. In hindsight, it’d be good to have a followup fill-in field to undertand what each respondent meant by that. The parsimonious interpretion is “cover more of the NumPy / pandas API”.
“Ease of deployment” had a high proportion of “critical to me”. Again in hindsight, I notice a bit of ambiguity. Does this mean people want Dask to be easier to deploy? Or does this mean that Dask, which they currently find easy to deploy, is critically important? Regardless, we can prioritize simplicity in deployment.
Relatively few respondents care about things like “Managing many users”, though we expect that this would be relatively popular among system administartors, who are a smaller population.
And of course, we have people pushing Dask to its limits for whom “Improving scaling” is critically important.
A relatively high proportion of respondents use Python 3 (97% compared to 84% in the most recent Python Developers Survey).
Name: Python 2 or 3?, dtype: object
We were a bit surprised to see that SSH is the most popular “cluster resource manager”.
HPC resource manager (SLURM, PBS, SGE, LSF or similar) 61
My workplace has a custom solution for this 23
I don't know, someone else does this for me 16
Hadoop / Yarn / EMR 14
Name: If you use a cluster, how do you launch Dask? , dtype: int64
How does cluster-resource manager compare with API usage?
Dask APIs Array Bag DataFrame Delayed Futures ML Xarray If you use a cluster, how do you launch Dask? Custom 15 6 18 17 14 6 7 HPC 50 13 40 40 22 11 30 Hadoop / Yarn / EMR 7 6 12 8 4 7 3 Kubernetes 40 18 56 47 37 26 21 SSH 61 23 72 58 32 30 25
HPC users are relatively heavy users of dask.array and xarray.
Somewhat surprisingly, Dask’s heaviest users find dask stable enough. Perhaps they’ve pushed past the bugs and found workarounds (percentages are normalized by row).
Thanks again to all the respondents. We look forward to repeating this process to identify trends over time.