Using bbo for auto-tuning of Machine Learning models
In this example, we will use bbo as a stand-alone black-box optimization library to find the optimum hyper-parameters (a Support Vector Machine and a RandomForestClassifier) for two Machine Learning models, on the breast_cancer dataset.
Loading the dataset
The breast_cancer dataset is a classic and very easy binary classification dataset. We will use it as an example for tuning the models. After loading the data, we will use the train_test_split function to divide the dataset into a train dataset (that we will train the model on) and a test dataset (where we will evaluate the model on to get its accuracy on unseen data). As bbo is a minimizer, we will use the opposite value of the accuracy as a training target.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data["data"], data["target"], test_size=0.33, random_state=42)
Design black-box class
To use bbo, we need to design two classes that act as black-box that can be tuned: they have a compute method that takes as input a parametrization, trains the model using this parametrization on the train data, evaluates the model on the test data and return the opposite of the accuracy.
We will tune two sklearn models: - SVM: we will look for the optimum value of: * C * kernel * degree * gamma * coef0 * shrinking * probability * tol
- Random forest: we will look for the optimum of:
- n_estimators
- criterion
- max_depth
- min_samples_split
- min_weight_fraction_leaf
- max_features
For each model, we will define the class with the compute method and the corresponding parametric_grid (i.e the values that can be tested by the optimizer).
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
class OptimizableSVM:
"""
Optimizable class to find the optimal parametrization of a SVM model.
"""
def compute(self, parameters):
"""
Outputs the cross validation score for the model.
"""
parameters_dict = {"C": float(parameters[0]), "kernel": str(parameters[1]), "degree": int(parameters[2]), "gamma": str(parameters[3]), "coef0": float(parameters[4]), "shrinking": bool(parameters[5]), "probability": bool(parameters[6]), "tol": float(parameters[7])}
svc = SVC(**parameters_dict)
svc.fit(X_train, y_train)
return -svc.score(X_test, y_test)
# Define parametric grid
c = np.arange(1, 20, 1)
kernel = np.array(["linear", "poly", "rbf", "sigmoid"])
degree = np.arange(1, 4, 1)
gamma = np.array(['scale', 'auto'])
coef = np.arange(0, 1, 0.01)
shrinking = np.array([True, False])
probability = np.array([True, False])
tol = np.arange(0.01, 0.05, 0.01)
svm_parametric_grid = np.array([c, kernel, degree, gamma, coef, shrinking, probability, tol], dtype=object)
class OptimizableRandomForest:
"""
Model that will act as a black-box.
"""
def compute(self, parameters):
parameters_dict = {"n_estimators": int(parameters[0]), "criterion": str(parameters[1]), "max_depth": float(parameters[2]), "min_samples_split": int(parameters[3]), "min_weight_fraction_leaf":float(parameters[4]), "max_features": str(parameters[5])}
random_forest = RandomForestClassifier(**parameters_dict)
random_forest.fit(X_train, y_train)
return -random_forest.score(X_test, y_test)
n_estimators = np.arange(50, 200, 20)
criterion = np.array(["gini", "entropy"])
max_depth = np.arange(5, 10, 1)
min_samples_split = np.arange(2, 10, 1)
min_weight_fraction_leaf = np.arange(0, 0.4, 0.1)
max_features = np.array(["auto", "sqrt", "log2"])
rf_parametric_grid = np.array([n_estimators, criterion, max_depth, min_samples_split, min_weight_fraction_leaf, max_features], dtype=object)
Setup optimizer
We then need to set-up the optimizer. As they are currently the only heuristic that support qualitative variable, we will use genetic algorithms, with single point crossovers and a tournament pick. The mutation rate is set to 0.3. We will have 5 initial data points, for a maximum of 20 iterations.
from bbo.optimizer import BBOptimizer
from bbo.heuristics.genetic_algorithm.mutations import mutate_chromosome_to_neighbor
from bbo.heuristics.genetic_algorithm.selections import tournament_pick
from bbo.heuristics.genetic_algorithm.crossover import single_point_crossover
svm_model = OptimizableSVM()
svm_bb = BBOptimizer(black_box = svm_model, # the black-box to optimize
parameter_space = svm_parametric_grid, # the grid on which to perform the optimization
initial_sample_size=10,# the initial size of the sample
heuristic="genetic_algorithm", # the name of the heuristics to use
max_iteration=20, # the maximum number of iterations
time_out=200, # in seconds, the maximum elapsed time
# the following arguments are specific to genetic algorithms:
mutation_method= mutate_chromosome_to_neighbor, # the mutation function
mutation_rate=0.3,#the mutation rate
crossover_method=single_point_crossover, # the crossover function
selection_method=tournament_pick # the selection function
)
svm_bb.optimize()
rf_model = OptimizableRandomForest()
rf_bb = BBOptimizer(black_box = rf_model, # the black-box to optimize
parameter_space = rf_parametric_grid, # the grid on which to perform the optimization
initial_sample_size=10,# the initial size of the sample
heuristic="genetic_algorithm", # the name of the heuristics to use
max_iteration=20, # the maximum number of iterations
time_out=200, # in seconds, the maximum elapsed time
# the following arguments are specific to genetic algorithms:
mutation_method= mutate_chromosome_to_neighbor, # the mutation function
mutation_rate=0.3,#the mutation rate
crossover_method=single_point_crossover, # the crossover function
selection_method=tournament_pick # the selection function
)
rf_bb.optimize()
Read results
The results of the optimization can be read using the summary method on each object. The fitness (i.e. the accuracy as a function of the iteration step) can also be plotted to look at the convergence trajectory.
from matplotlib import pyplot as plt
svm_bb.summarize()
plt.plot(-svm_bb.history["fitness"])
plt.title("Accuracy as a function of the number of iterations for SVM model")
rf_bb.summarize()
plt.plot(-rf_bb.history["fitness"])
plt.title("Accuracy as a function of the number of iterations for Random Forest model")