We’ve sufficiently aligned Dask DataFrame and cuDF to get groupby aggregationslike the following to work well.
This post describes the kind of work we had to do as a model for futuredevelopment.
On the Dask side this mostly meant replacing
On the cuDF side this means making dozens of tiny changes to align the cuDF APIto the Pandas API, and to add in missing features.
I don’t really expect anyone to go through all of those issues, but my hope isthat by skimming over the issue titles people will get a sense for the kinds ofchanges we’re making here. It’s a large number of small things.
Also, kudos to Thomson Comer who solved most ofthe cuDF issues above.
But generally things work pretty well today:
In : import dask_cudf
In : df = dask_cudf.read_csv('yellow_tripdata_2016-*.csv')
In : df.groupby('passenger_count').trip_distance.mean().compute()
Out: <cudf.Series nrows=10 >
In : _.to_pandas()
First, most of this work was handled by the cuDF developers (which may beevident from the relative lengths of the issue lists above). When we startedthis process it felt like a never-ending stream of tiny issues. We weren’table to see the next set of issues until we had finished the current set.Fortunately, most of them were pretty easy to fix. Additionally, as we wenton, it seemed to get a bit easier over time.
Additionally, lots of things work other than groupby-aggregations as a resultof the changes above. From the perspective of someone accustomed to Pandas,The cuDF library is starting to feel more reliable. We hit missingfunctionality less frequently when using cuDF on other operations.
More recently we’ve been working on the various join/merge operations in DaskDataFrame like indexed joins on a sorted column, joins between large and smalldataframes (a common special case) and so on. Getting these algorithms fromthe mainline Dask DataFrame codebase to work with cuDF is resulting in asimilar set of issues to what we saw above with groupby-aggregations, but sofar the list is much smaller. We hope that this is a trend as we continue onto other sets of functionality into the future like I/O, time-seriesoperations, rolling windows, and so on.