People often ask me “How many people use Dask?”
As with any non-invasive open source software, the answer to this is“I don’t know”.
There are many possible proxies for user counts, like downloads, GitHub stars,and so on, but most of them are wildly incorrect.As a project maintainer who tries to find employment for other maintainers,I’m incentivized to take the highest number I can find,but that is somewhat dishonest.That number today is in the form of this likely false statement.
Dask has 50-100k daily downloads.
This number comes from looking at the Python Package Index (PyPI)(image from pypistats.org)
This is a huge number, but is almost certainly misleading.Common sense tells us that there are not 100k new Dask users every day.
If you dive in more deeply to numbers like these you will find that they arealmost entirely due to automated processes. For example, of Dask’s 100k newusers, a surprising number of them seem to be running Linux.
While it’s true that Dask is frequently run on Linux because it is adistributed library, it would be odd to see every machine in that deploymentindividually pip install dask. It’s more likely that these downloads are theresult of automated systems, rather than individual users.
Anecdotally, if you get access to fine grained download data, one finds that asmall set of IPs dominate download counts. These tend to come mostly fromcontinuous integration services like Travis and Circle, are coming from AWS,or are coming from a few outliers in the world (sometimes people in China tryto mirror everything)..
So, in an effort to avoid this effect we start looking at just Windowsdownloads.
The magnitudes here seem more honest to me. These monthly numbers translate toabout 1000 downloads a day (perhaps multiplied by two or three for OSX andLinux), which seems more in line with my expectations.
However even this is strange. The structure doesn’t match my personal experience.Why the big change in adoption in 2018?What is the big spike in 2019?Anecdotally maintainers did not notice a significant jump in users there.Instead, we’ve experienced smooth continuous growth of adoption over time(this is what most long-term software growth looks like).It’s also odd that there hasn’t been continued growth since 2018. AnecdotallyDask seems to have grown somewhat constantly over the last few years. Phasetransitions like these don’t match observed reality (at least in so far as Ipersonally have observed it).
My favorite metric is looking at weekly unique users to documentation.
This is an over-estimate of users because many people look at the documentationwithout using the project. This is also an under-estimate because many usersdon’t consult our documentation on a weekly basis (oh I wish).
This growth pattern matches my expectations and my experience with maintaininga project that has steadily gained traction over several years.
Plot taken from Google Analytics
It’s also important to look at dependencies of a project. For example manyusers in the earth and geo sciences use Dask through another project,Xarray. These users are much less likely to touchDask directly, but often use Dask as infrastructure underneath the Xarraylibrary. We should probably add in something like half of Xarray’s users aswell.
Plot taken from Google Analytics, supplied by Joe Hamman from Xarray
Dask has somewhere between 100k new users every day (download counts)or something like 10k users total (weekly unique IPs). The 10k number soundsmore likely to me, maybe bumping up to 15k due to dependencies.The fact is though that no one really knows.
Judging the use of community maintained OSS is important as we try to value itsimpact on society. This is also a fundamentally difficult problem.I hope that this post helps to highlight how these numbers may be misleading,and encourages us all to think more deeply about estimating impact.