Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Mar 4, 2019

Building GPU Groupby-Aggregations for Dask

By

Summary

We’ve sufficiently aligned Dask DataFrame and cuDF to get groupby aggregationslike the following to work well.

df.groupby('x').y.mean()

This post describes the kind of work we had to do as a model for futuredevelopment.

Plan

As outlined in a previous post, Dask, Pandas, and GPUs: firststeps, our plan to producedistributed GPU dataframes was to combine DaskDataFrame withcudf. In particular, we had to

  • change Dask DataFrame so that it would parallelize not just around thePandas DataFrames that it works with today, but around anything that lookedenough like a Pandas DataFrame
  • change cuDF so that it would look enough like a Pandas DataFrame to fitwithin the algorithms in Dask DataFrame

Changes

On the Dask side this mostly meant replacing

  • Replacing isinstance(df, pd.DataFrame) checks with is_dataframe_like(df)checks (after defining a suitableis_dataframe_like/is_series_like/is_index_like functions
  • Avoiding some more exotic functionality in Pandas, and instead trying touse more common functionality that we can expect to be in most DataFrameimplementations

On the cuDF side this means making dozens of tiny changes to align the cuDF APIto the Pandas API, and to add in missing features.

I don’t really expect anyone to go through all of those issues, but my hope isthat by skimming over the issue titles people will get a sense for the kinds ofchanges we’re making here. It’s a large number of small things.

Also, kudos to Thomson Comer who solved most ofthe cuDF issues above.

There are still some pending issues

But things mostly work

But generally things work pretty well today:

In [1]: import dask_cudf

In [2]: df = dask_cudf.read_csv('yellow_tripdata_2016-*.csv')

In [3]: df.groupby('passenger_count').trip_distance.mean().compute()
Out[3]: <cudf.Series nrows=10 >

In [4]: _.to_pandas()
Out[4]:
0 0.625424
1 4.976895
2 4.470014
3 5.955262
4 4.328076
5 3.079661
6 2.998077
7 3.147452
8 5.165570
9 5.916169
dtype: float64

Experience

First, most of this work was handled by the cuDF developers (which may beevident from the relative lengths of the issue lists above). When we startedthis process it felt like a never-ending stream of tiny issues. We weren’table to see the next set of issues until we had finished the current set.Fortunately, most of them were pretty easy to fix. Additionally, as we wenton, it seemed to get a bit easier over time.

Additionally, lots of things work other than groupby-aggregations as a resultof the changes above. From the perspective of someone accustomed to Pandas,The cuDF library is starting to feel more reliable. We hit missingfunctionality less frequently when using cuDF on other operations.

What’s next?

More recently we’ve been working on the various join/merge operations in DaskDataFrame like indexed joins on a sorted column, joins between large and smalldataframes (a common special case) and so on. Getting these algorithms fromthe mainline Dask DataFrame codebase to work with cuDF is resulting in asimilar set of issues to what we saw above with groupby-aggregations, but sofar the list is much smaller. We hope that this is a trend as we continue onto other sets of functionality into the future like I/O, time-seriesoperations, rolling windows, and so on.