Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Contribute a Blog

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Mar 27, 2019

cuML and Dask hyperparameter optimization

By

Setup

  • DGX-1 Workstation
  • Host Memory: 512 GB
  • GPU Tesla V100 x 8
  • cudf 0.6
  • cuml 0.6
  • dask 1.1.4
  • Jupyter notebook

TLDR; Hyper-parameter Optimization is functional but slow with cuML

cuML and Dask Hyper-parameter Optimization

cuML is an open source GPU accelerated machine learning library primarilydeveloped at NVIDIA which mirrors the Scikit-Learn API.The current suite of algorithms includes GLMs, Kalman Filtering, clustering,and dimensionality reduction. Many of these machine learning algorithms usehyper-parameters. These are parameters used during the model training processbut are not “learned” during the training. Often these parameters arecoefficients or penalty thresholds and finding the “best” hyper parameter can becomputationally costly. In the PyData community, we often reach to Scikit-Learn’sGridSearchCVorRandomizedSearchCVfor easy definition of the search space for hyper-parameters – this is called hyper-parameteroptimization. Within the Dask community, Dask-ML has incrementally improved the efficiency of hyper-parameter optimization by leveraging both Scikit-Learn and Dask to use multi-core anddistributed schedulers: Grid and RandomizedSearch with DaskML.

With the newly created drop-in replacement for Scikit-Learn, cuML, we experimented with Dask’s GridSearchCV. In the upcoming 0.6 release of cuML, the estimators are serializable and are functional within the Scikit-Learn/dask-ml framework, but slow compared with Scikit-Learn estimators. And while speeds are slow now, we know how to boost performance, have filed several issues, and hope to show performance gains in future releases.

All code and timing measurements can be found in this Jupyter notebook

Fast Fitting!

cuML is fast! But finding that speed requires developing a bit of GPU knowledge and someintuition. For example, there is a non-zero cost of moving data from device to GPU and, when data is “small” there are little to no performance gains. “Small”, currently might mean less than 100MB.

In the following example we use the diabetes dataset provided by sklearn and linearly fit the data with RidgeRegression

\[\min\limits_w ||y - Xw||^2_2 + alpha \* ||w||^2_2\]

alpha is the hyper-parameter and we initially set to 1.

import numpy as np
from cuml import Ridge as cumlRidge
import dask_ml.model_selection as dcv
from sklearn import datasets, linear_model
from sklearn.externals.joblib import parallel_backend
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2)

fit_intercept = True
normalize = False
alpha = np.array([1.0])

ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

ridge.fit(X_train, y_train)
cu_ridge.fit(X_train, y_train)=

The above ran with a single timing measurement of:

  • Scikit-Learn Ridge: 28 ms
  • cuML Ridge: 1.12 s

But the data is quite small, ~28KB. Increasing the size to ~2.8GB and re-running we see significant gains:

dup_ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
dup_cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

# move data from host to device
record_data = (('fea%d'%i, dup_data[:,i]) for i in range(dup_data.shape[1]))
gdf_data = cudf.DataFrame(record_data)
gdf_train = cudf.DataFrame(dict(train=dup_train))

#sklearn
dup_ridge.fit(dup_data, dup_train)

# cuml
dup_cu_ridge.fit(gdf_data, gdf_train.train)

With new timing measurements of:

  • Scikit-Learn Ridge: 4.82 s ± 694 ms
  • cuML Ridge: 450 ms ± 47.6 ms

With more data we clearly see faster fitting times, but the time to move data to the GPU (through CUDF)was 19.7s. This cost of data movement is one of the reasons why RAPIDS/cuDF was developed – keep dataon the GPU and avoid having to move back and forth.

Hyper-Parameter Optimization Experiments

So moving to the GPU can be costly, but once there, with larger data sizes, we gain significant performanceoptimizations. Naively, we thought, “well, we have GPU machine learning, we have distributed hyper-parameter optimization…we should have distributed, GPU-accelerated, hyper-parameter optimization!”

Scikit-Learn assumes a specific, but well defined API for estimators over which it will perform hyper-parameteroptimization. Most estimators/classifiers in Scikit-Learn look like the following:

class DummyEstimator(BaseEstimator):
def __init__(self, params=...):
...

def fit(self, X, y=None):
...

def predict(self, X):
...

def score(self, X, y=None):
...

def get_params(self):
...

def set_params(self, params...):
...

When we started experimenting with hyper-parameter optimization, we found a few API holes missing, these wereresolved, mostly handling matching argument structure and various getters/setters.

  • get_params and set_params (#271)
  • fix/clf-solver (#318)
  • map fit_transform to sklearn implementation (#330)
  • Fea get params small changes (#322)

With holes plugged up we tested again. Using the same diabetes data set, we are now performing hyper-parameter optimizationand searching over many alpha parameters for the best scoring alpha.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than sklearn. Instead, we want to demonstrate functionality.

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than Scikit-Learn. Instead, wewant to demonstrate functionality. Additionally, we also tried swapping out Dask-ML’s implementation of GridSearchCV(which adheres to the same API as Scikit-Learn) to use all of the GPUs we have available in parallel.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Timing Measurements:

GridSearchCV sklearn-Ridge cuml-ridge Scikit-Learn 88.4 ms ± 6.11 ms 6.51 s ± 132 ms Dask-ML 873 ms ± 347 ms 740 ms ± 142 ms

Unsurprisingly, we see that GridSearchCV and Ridge Regression from Scikit-Learn is the fastest in this context.There is cost to distributing work and data, and as we previously mentioned, moving data from host to device.

How does performance scale as we scale data?

two_dup_data = np.array(np.vstack([X_train]*int(1e2)))
two_dup_train = np.array(np.hstack([y_train]*int(1e2)))
three_dup_data = np.array(np.vstack([X_train]*int(1e3)))
three_dup_train = np.array(np.hstack([y_train]*int(1e3)))

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(two_dup_data, two_dup_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(three_dup_data, three_dup_train)

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(three_dup_data, three_dup_train)

Timing Measurements:

Data (MB) cuML+Dask-ML sklearn+Dask-ML 2.8 MB 13.8s 28 MB 1min 17s 4.87 s

cuML + dask-ml (Distributed GridSearchCV) does significantly worse as data sizes increase! Why? Primarily, two reasons:

  1. Non optimized movement of data between host and device compounded by N devices and the size ofthe parameter space
  2. Scoring methods are not implemented in with cuML

Below is the Dask graph for the GridSearch

There are 50 (cv=5 times 10 parameters for alpha) instances of chunking up our test data set and scoring performance. That means 50 times we are moving data back forth between host and device for fitting and 50 times for scoring. That’s not great, but it’s also very solvable – build scoring functions for GPUs!

Immediate Future Work

We know the problems, GH Issues have been filed, and we are working on these issues – come help!

  • Built In Scorers (#242)
  • DeviceNDArray as input data (#369)
  • Communication with UCX (#2344)