![]() ![]() With this type of search, it is likely that one encounters close-to-optimal regions of the hyper-param space early on. An interesting alternative is scanning the whole grid in a fully randomized way that is, according to a random permutation of the whole grid. The reason is that with lexicographic ordering there is a high chance that the search will focus on a rather uninteresting part of the search space for a rather long time. Generally, the use of lexicographic ordering, that is, the dictionary order imposed on hyper-param vectors, is discouraged and a different order should be considered. Exhaustive Grid Search (GS)Įxhaustive grid search (GS) is nothing other than the brute force approach that scans the whole grid of hyper-param combinations h in some order, computes the cross-validation loss for each one and finds the optimal h* in this manner. ![]() In what follows, we will use the vector notation symbol h = to denote any such combination, that is, any point in the grid. This discrete subspace of all possible hyper-parameters is called the hyper-parameter grid. Notice that despite having limited the range for the (continuous) learning_rate hyper-parameter to only six values, that of max_depth to 8, and so forth, there are 6 x 8 x 4 x 5 x 4 = 3840 possible combinations of hyper parameters. Symbolically, we haveĪ section of the hyper-param grid, showing only the first two variables (coordinate directions). Given a hyper-param vector h, its quality is assessed by evaluating the loss function on a held-out set of validation data using the best set of learnable parameters, a*, for that value of h. To choose the optimal set of hyper-params, the usual approach is to perform cross-validation. The crucial observation here is that this minimization is done by letting only the learnable parameters vary, while holding the data and the hyper-params constant. ![]() the MSE or the classification error) that depends on the data training input data ( Xt, Yt), the learnable params which we denote by a, and the hyper-params. For neural networks, the list includes the number of hidden layers, the size (and shape) of each layer, the choice of activation function, the drop-out rate and the L1/L2 regularization constants.įrom a computational point of view, supervised ML boils down to minimizing a certain loss function (e.g. In tree-based models, hyper-parameters include things like the maximum depth of the tree, the number of trees to grow, the number of variables to consider when building each tree, the minimum number of samples on a leaf, the fraction of observations used to build a tree, and a few others. Further, the algo typically does not include any logic to optimize them for us. These are parameters specified by “hand” to the algo and fixed throughout a training pass. The more flexible and powerful an algorithm is, the more design decisions and adjustable hyper-parameters it will have. Learnable parameters are, however, only part of the story. In neural networks, the learnable parameters are the weights used on each connection to amplify or negate the activation from a previous layer into the next one. For instance, in tree-based models (decision trees, random forests, XGBoost), the learnable parameters are the choice of decision variables at each node and the numeric thresholds used to decide whether to take the left or right branch when generating predictions. ![]() The most powerful ML algorithms are famous for picking up patterns and regularities in the data by automatically tuning thousands (or even millions) of so-called “learnable” parameters. Although we focus on optimizing XGBoost hyper-parameters in our experiment, pretty much all of what we will present applies to any other advanced ML algorithms. We report on the results of an experiment in which we use each of these to come up with good hyperparameters on an example ML problem taken from Kaggle. After reviewing what hyper-parameters, or hyper-params for short, are and how they differ from plain vanilla learnable parameters, we introduce three general purpose discrete optimization algorithms aimed at search for the optimal hyper-param combination: grid-search, coordinate descent and genetic algorithms. In this post and the next, we will look at one of the trickiest and most critical problems in Machine Learning (ML): Hyper-parameter tuning. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |