Scott Sievert wrote this post. The original post lives athttps://stsievert.com/blog/2019/09/27/dask-hyperparam-opt/ with betterstyling. This work is supported by Anaconda, Inc.
Dask’s machine learning package, Dask-ML now implements Hyperband, anadvanced “hyperparameter optimization” algorithm that performs rather well.This post will
In this post, I’ll walk through a practical example and highlight key portionsof the paper “Better and faster hyperparameter optimization with Dask”, which is alsosummarized in a ~25 minute SciPy 2019 talk.
Machine learning requires data, an untrained model and “hyperparameters”, parameters that are chosen before training begins thathelp with cohesion between the model and data. The user needs to specify valuesfor these hyperparameters in order to use the model. A good example isadapting ridge regression or LASSO to the amount of noise in thedata with the regularization parameter.1
Model performance strongly depends on the hyperparameters provided. A fairly complex example is with a particularvisualization tool, t-SNE. This tool requires (at least) threehyperparameters and performance depends radically on the hyperparameters. In fact, the first section in “How to Use t-SNEEffectively” is titled “Those hyperparameters really matter”.
Finding good values for these hyperparameters is critical and has an entireScikit-learn documentation page, “Tuning the hyperparameters of anestimator.” Briefly, finding decent values of hyperparametersis difficult and requires guessing or searching.
How can these hyperparameters be found quickly and efficiently with anadvanced task scheduler like Dask? Parallelism will pose some challenges, butthe Dask architecture enables some advanced algorithms.
Note: this post presumes knowledge of Dask basics. This material is covered inDask’s documentation on Why Dask?, a ~15 minute video introduction toDask, a video introduction to Dask-ML and ablog post I wrote on my first use of Dask.
Dask-ML can quickly find high-performing hyperparameters. I will back thisclaim with intuition and experimental evidence.
Specifically, this is becauseDask-ML nowimplements an algorithm introduced by Li et. al. in “Hyperband: A novelbandit-based approach to hyperparameter optimization”.Pairing of Dask and Hyperband enables some exciting new performance opportunities,especially because Hyperband has a simple implementation and Dask is anadvanced task scheduler.2
Let’s gothrough the basics of Hyperband then illustrate its use and performance withan example.This will highlight some key points of the corresponding paper.
The motivation for Hyperband is to find high performing hyperparameters with minimaltraining. Given this goal, it makes sense to spend more time training highperforming models – why waste more time training time a model if it’s done poorly in the past?
One method to spend more time on high performing models is to initialize manymodels, start training all of them, and then stop training low performing modelsbefore training is finished. That’s what Hyperband does. At the most basiclevel, Hyperband is a (principled) early-stopping scheme forRandomizedSearchCV.
Deciding when to stop the training of models depends on how stronglythe training data effects the score. There are two extremes:
Hyperband balances these two extremes by sweeping over how frequentlymodels are stopped. This sweep allows a mathematical proof that Hyperbandwill find the best model possible with minimal partial_fitcalls3.
Hyperband has significant parallelism because it has two “embarrassinglyparallel” for-loops – Dask can exploit this. Hyperband has been implementedin Dask, specifically in Dask’s machine library Dask-ML.
How well does it perform? Let’s illustrate via example. Some setup is requiredbefore the performance comparison in Performance.
Note: want to try HyperbandSearchCV out yourself? Dask has an example use.It can even be run in-browser!
I’ll illustrate with a synthetic example. Let’s build a dataset with 4 classes:
>>> from experiment import make_circles
>>> X, y = make_circles(n_classes=4, n_features=6, n_informative=2)
>>> scatter(X[:, :2], color=y)
Note: this content is pulled fromstsievert/dask-hyperband-comparison, or makes slight modifications.
Let’s build a fully connected neural net with 24 neurons for classification:
>>> from sklearn.neural_network import MLPClassifier
>>> model = MLPClassifier()
Building the neural net with PyTorch is also possible4 (and what I used in development).
This neural net’s behavior is dictated by 6 hyperparameters. Only one controlsthe model of the optimal architecture (hidden_layer_sizes, the number ofneurons in each layer). The rest control finding the best model of thatarchitecture. Details on the hyperparameters are in theAppendix.
>>> params = ... # details in appendix
>>> params.keys()
dict_keys(['hidden_layer_sizes', 'alpha', 'batch_size', 'learning_rate'
'learning_rate_init', 'power_t', 'momentum'])
>>> params["hidden_layer_sizes"] # always 24 neurons
[(24, ), (12, 12), (6, 6, 6, 6), (4, 4, 4, 4, 4, 4), (12, 6, 3, 3)]
I choose these hyperparameters to have a complex search space that mimics thesearches performed for most neural networks. These searches typically involvehyperparameters like “dropout”, “learning rate”, “momentum” and “weightdecay”.5End users don’t care hyperparameters like these; they don’t change themodel architecture, only finding the best model of a particular architecture.
How can high performing hyperparameter values be found quickly?
First, let’s look at the parameters required for Dask-ML’s implementationof Hyperband (which is in the class HyperbandSearchCV).
HyperbandSearchCV has two inputs:
These fall out pretty naturally once it’s known how long to train the bestmodel and very approximately how many parameters to sample:
n_examples = 50 * len(X_train) # 50 passes through dataset for best model
n_params = 299 # sample about 300 parameters
# inputs to hyperband
max_iter = n_params
chunk_size = n_examples // n_params
The inputs to this rule-of-thumb are exactly what the user cares about:
Notably, there’s no tradeoff between n_examples and n_params like withScikit-learn’s RandomizedSearchCV because n_examples is only for somemodels, not for all models. There’s more details on thisrule-of-thumb in the “Notes” section of the HyperbandSearchCVdocs.
With these inputs a HyperbandSearchCV object can easily be created.
This model selection algorithm Hyperband is implemented in the classHyperbandSearchCV. Let’s create an instance of that class:
>>> from dask_ml.model_selection import HyperbandSearchCV
>>>
>>> search = HyperbandSearchCV(
... est, params, max_iter=max_iter, aggressiveness=4
... )
aggressiveness defaults to 3. aggressiveness=4 is chosen because this is aninitial search; I know nothing about how this search space. Then, this searchshould be more aggressive in culling off bad models.
Hyperband hides some details from the user (which enables the mathematicalguarantees), specifically the details on the amount of training andthe number of models created. These details are available in the metadataattribute:
>>> search.metadata["n_models"]
378
>>> search.metadata["partial_fit_calls"]
5721
Now that we have some idea on how long the computation will take, let’s ask itto find the best set of hyperparameters:
>>> from dask_ml.model_selection import train_test_split
>>> X_train, y_train, X_test, y_test = train_test_split(X, y)
>>>
>>> X_train = X_train.rechunk(chunk_size)
>>> y_train = y_train.rechunk(chunk_size)
>>>
>>> search.fit(X_train, y_train)
The dashboard will be active during this time6:
Your browser does not support the video tag.
How well do these hyperparameters perform?
>>> search.best_score_
0.9019221418447483
HyperbandSearchCV mirrors Scikit-learn’s API for RandomizedSearchCV, so ithas access to all the expected attributes and methods:
>>> search.best_params_
{"batch_size": 64, "hidden_layer_sizes": [6, 6, 6, 6], ...}
>>> search.score(X_test, y_test)
0.8989070100111217
>>> search.best_model_
MLPClassifier(...)
Details on the attributes and methods are in the HyperbandSearchCVdocumentation.
I ran this 200 times on my personal laptop with 4 cores.Let’s look at the distribution of final validation scores:
The “passive” comparison is really RandomizedSearchCV configured so it takesan equal amount of work as HyperbandSearchCV. Let’s see how this does overtime:
This graph shows the mean score over the 200 runs with the solid line, and theshaded region represents the interquartile range. The dotted greenline indicates the data required to train 4 models to completion.“Passes through the dataset” is a good proxyfor “time to solution” because there are only 4 workers.
This graph shows that HyperbandSearchCV will find parameters at least 3 timesquicker than RandomizedSearchCV.
What opportunities does combining Hyperband and Dask create?HyperbandSearchCV has a lot of internal parallelism and Dask is an advanced taskscheduler.
The most obvious opportunity involves job prioritization. Hyperband fits manymodels in parallel and Dask might not have thatworkers available. This means some jobs have to wait for other jobsto finish. Of course, Dask can prioritize jobs7 and choose which modelsto fit first.
Let’s assign the priority for fitting a certain model to be the model’s mostrecent score. How does this prioritization scheme influence the score? Let’scompare the prioritization schemes ina single run of the 200 above:
These two lines are the same in every way except forthe prioritization scheme.This graph compares the “high scores” prioritization scheme and the Dask’sdefault prioritization scheme (“fifo”).
This graph is certainly helped by the fact that is run with only 4 workers.Job priority does not matter if every job can be run right away (there’snothing to assign priority too!).
How does Hyperband scale with the number of workers?
I ran another separate experiment to measure. This experiment is described more in the correspondingpaper, but the relevant difference is that a PyTorch neural network is usedthrough skorch instead of Scikit-learn’s MLPClassifier.
I ran the same experiment with a different number of Daskworkers.8 Here’s how HyperbandSearchCV scales:
Training one model to completion requires 243 seconds (which is marked by thewhite line). This is a comparison with patience, which stops training modelsif their scores aren’t increasing enough. Functionally, this is very usefulbecause the user might accidentally specify n_examples to be too large.
It looks like the speedups start to saturate somewherebetween 16 and 24 workers, at least for this example.Of course, patience doesn’t work as well for a large number ofworkers.9
There’s some ongoing pull requests to improve HyperbandSearchCV. The mostsignificant of these involves tweaking some Hyperband internals so HyperbandSearchCVworks better with initial or very exploratory searches (dask/dask-ml #532).
The biggest improvement I see is treating dataset size as the scarce resourcethat needs to be preserved instead of training time. This would allowHyperband to work with any model, instead of only models that implementpartial_fit.
Serialization is an important part of the distributed Hyperband implementationin HyperbandSearchCV. Scikit-learn and PyTorch can easily handle this becausethey support the Pickle protocol10, butKeras/Tensorflow/MXNet present challenges. The use of HyperbandSearchCV couldbe increased by resolving this issue.
I choose to tune 7 hyperparameters, which are
More hyperparameters control finding the best neural network: