How to Random Search for Hyperparameters

The process of selecting a good model can be time consuming as there are many parameters to tune and an infinite number of combinations to consider. Once reasonable values have been identified for the model and dataset, performing hyperparameter searches is an efficient way to explore the hyperparameter space and potentially find an even better model fit.

The Randomized Search Cross-Validation option is best run near the beginning of your hyperparameter search process, as it randomly will select a combination of values from user specified ranges which you'll provide based on your knowledge of the dataset. The niml model is then run with random combinations of these values a specified number of times without replacement. Ideally, this search will help you narrow the parameters to focus on further, at which point you can manually fine tune or attempt a different search method to refine your parameter values even further.

Code Example

The following code is a detailed demonstration on how to run RandomizedSearchCV on the NIML NPU using a Base Estimator wrapper.

Users define the the scoring, number of cross-validation jobs, number of parallel jobs, number of searches, and whether or not to include the training scores. A fit is then performed on the Base Estimator wrapper with parameters that are randomly chosen on the train and test data. Command line output then displays the results of the best estimator fit score and the parameters that lead to that score. A CSV file is also created that contains the parameters of the NPU upon running the fit, the time taken to perform the fit, and the chosen metric results.

import csv
import json
import numpy as np
from niml.encoder import encoder
from niml.model.nispooler import sp_pooler_c
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, ShuffleSplit
import pandas as pd
from sklearn.utils.estimator_checks import check_estimator
import time
import os

Split Data for Train and Test

If you do not already have train and test dataset, load in the desired dataset and split it into a train and test subsets.

#separate data into train and test splits
#RandomSearchCV will train and validate within the training data split only

train,test = train_test_split(data, test_size=0.20)

Construct a Default Model or NPU (The Base Estimator)

Construct a default model.Model object or NPU object that will be used by the estimator. These are the values that will be used if no search range is provided to the model.

This is referred to as the Base Estimator, and it requires default values for every variable required to run the model.

estimator_p = Pooler(
    # Encoder/Data parameters
    sdr_width=784,
    enc_set_bits=4,
    enc_sparsity=0.25,
    enc_missing_val_ind = "?",
    tts_test_size = 0.20, #Set encoding train_test_split test size
    tts_random_state = np.random.RandomState(seed=592), #Set encoding train_test_split random_state

    # NPU
    neurons=500,
    active_neurons=5,
    input_pct=0.75,
    seed= 123, #Typically the arg seed
    synapse_inc=5,
    synapse_dec=1,
    activity_reset_cnt=50,
    decay_cnt_target=10,
    learning=True,

    # Boosting
    boost_max=0.9,
    boost_str=2,
    boost_tbl_size=21,
    boost_tbl_step=0.03,

    # Classifier
    classifier_ = KNeighborsClassifier(n_neighbors=3), # Set the classifier
)

Define Scoring Metrics

Define the type of scoring you would like to use, as well as number of cross-validation jobs, total searches, number of jobs to run in parallel, and whether or not you'd like to training scores to be included in the resulting CSV.

SCORING = 'f1_weighted'
NUM_CV_JOBS = 2 #Default is 2
NUM_SEARCHES = 1 #Default is 1
NUM_PARALLEL_JOBS = -1 #Default is all (-1)
TRAIN = False  # whether to include training scores or not

Define Search Ranges

Define the dictionary of hyperparameters over which you would like to perform a Randomized Search. As shown below, the parameter ranges must be specified as lists of values.

Note that Random Search's methodology is just that - random! Depending on your specified ranges, this means you could end up with non-functional models if you're not careful.

For example, if you specify neurons as [2000, 1000, 500, 200] and active neurons as [ 350, 150, 50, 25], Random Search might choose neurons = 200 and active neurons =350 for one of its runs. If this happens, you'll receive an error instead of an output since neurons must always be greater than active neurons!

param_dis = {
    'neurons': [1000],
    'active_neurons' : list(range(1,1000)),
    'input_pct': [x/100.0 for x in range(1,99)],
    'synapse_inc': list(range(1,20)),
    'synapse_dec': list(range(1,20)),
    'activity_reset_cnt': list(range(1,50)),
    'enc_set_bits': list(range(3,30)),
    'enc_sparsity': [x/100.0 for x in range(1,99)],
    'boost_max': [x/100.0 for x in range(1,99)],
    'boost_str': list(range(1,10)),
    'boost_tbl_size': [21],
    'boost_tbl_step': [x/100.0 for x in range(1,99)],
}

Set Up the Search

you would like to use and pass it the Pooler object created before as the estimator, the desired parameter dictionary, the desired scoring, how many cross-validation jobs to perform, how many parallel jobs to run, whether or not you would like to return the results of the train dataset, and how many messages you would like print as searches are completed.

search = RandomizedSearchCV(
    estimator=estimator_p, 
    param_distributions=param_dis,
    scoring=SCORING,
    cv=NUM_CV_JOBS,
    n_jobs=NUM_PARALLEL_JOBS,
    n_iter=NUM_SEARCHES,
    return_train_score=TRAIN
)
print("Random search - %d searches x %d cross validations" % (NUM_SEARCHES, NUM_CV_JOBS))

Random search - 1 searches x 2 cross validations

Set up Encoder and Search Requirements

pd_df_train = pd.DataFrame(train)
pd_df_test = pd.DataFrame(test)
estimator_p._get_dtypes()
estimator_p._configure_encoder_params()
estimator_p.encoder_ = estimator_p._create_encoder(full_data) # Use full dataset as the _create_encoder performs a train_test_split and configs encoder on the train split 
train_labels, train_isdrs, sdr_width = estimator_p.encoder_.encode(input_data=pd_df_train.iloc[:, 1:], label_col=None)
test_labels, test_isdrs, sdr_width = estimator_p.encoder_.encode(input_data=pd_df_test.iloc[:, 1:], label_col=None)
# Flatten and convert into integers
train_labels = [int(train[index][0]) for index in range(len(train))]
# Flatten and convert into integers
test_labels = [int(test[index][0]) for index in range(len(test))]
df = pd.DataFrame()

Specify the number of searches

Since this search picks a random combination, it is suggested your total runs scales with the size of your dictionary in order to ensure that values across the ranges of hyperparameters get tested.

runs = 5 # Total number of runs to perform (Default 1)

Set up your CSV requirements

You'll need to specify the name of the output CSV you would like to save. You'll also need to specify the parameters you want recorded in the CSV. We recommend saving all the parameters used to build the model, not just the ones that were randomly selected from the search space.

#name your CSV file
csv_file = "JupyterNB_Rand"

#list all the parameters you want recorded in the csv file
params = ['sdr_width', 'neurons', 'active_neurons', 'input_pct', 'seed', 'synapse_inc',            
            'synapse_dec', 'activity_reset_cnt', 'decay_cnt_target', 'feat_count', 'boost_max', 
            'boost_str', 'boost_tbl_size', 'boost_tbl_step', 'enc_set_bits', 'enc_sparsity', 'enc_missing_val_ind',
            'learning']

Run Your Search

Perform as many runs as declared previously by running a fit on the train data and searching for the best scoring Base Estimator.

best_score = 0
best_score_params = []
run_count = 0
while (run_count < runs):
    print("#############################################")
    print("Run %d" % run_count)
    search.fit(train_isdrs,train_labels)
    clf = search.best_estimator_
    print("params:", search.best_estimator_.get_params())
    print("score:", search.best_score_)
    run_count = run_count + 1
    ### Keep track of the best overall score
    if search.best_score_ >= best_score:
        best_score = search.best_score_
        best_score_params = search.best_estimator_.get_params()
        search.best_estimator_.save_pooler("%s_pooler.bin" % pooler_file, format="bin")

    ## Save the CSV after each run, so no data is lost
    clf_params = clf.get_params()
    defaults = {}
    for param in params:
        if param in clf_params:
            defaults[f'param_{param}'] = clf_params[param] 
    df_ap = pd.DataFrame(search.cv_results_)
    for param, val in defaults.items():
        if param not in df.columns:
            df[params] = val
    df = df.append(df_ap, ignore_index=True)
    name = csv_file + "_RUNNING.csv"
    df.to_csv("%s" % name) # Save the CSV in case of the user cancelling a run, or error
    print("Saved running file %s" % name)

#############################################
Run 0
params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 824, 'input_pct': 0.75, 'seed': 123, 'synapse_inc': 6, 'synapse_dec': 9, 'activity_reset_cnt': 9, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.83, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.68, 'enc_set_bits': 10, 'enc_sparsity': 0.54}
score: 0.9423720115545755
Saved running file JupyterNB_Rand_RUNNING.csv
#############################################
Run 1
params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 563, 'input_pct': 0.5, 'seed': 123, 'synapse_inc': 19, 'synapse_dec': 9, 'activity_reset_cnt': 36, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.67, 'boost_str': 7, 'boost_tbl_size': 21, 'boost_tbl_step': 0.69, 'enc_set_bits': 19, 'enc_sparsity': 0.58}
score: 0.9469344347598943
Saved running file JupyterNB_Rand_RUNNING.csv
#############################################
Run 2
params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 620, 'input_pct': 0.14, 'seed': 123, 'synapse_inc': 4, 'synapse_dec': 10, 'activity_reset_cnt': 35, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.98, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.26, 'enc_set_bits': 16, 'enc_sparsity': 0.07}
score: 0.9421523256300532
Saved running file JupyterNB_Rand_RUNNING.csv
#############################################
Run 3
params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 280, 'input_pct': 0.17, 'seed': 123, 'synapse_inc': 16, 'synapse_dec': 2, 'activity_reset_cnt': 30, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.38, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.94, 'enc_set_bits': 19, 'enc_sparsity': 0.78}
score: 0.9512090777358317
Saved running file JupyterNB_Rand_RUNNING.csv
#############################################
Run 4
params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 685, 'input_pct': 0.43, 'seed': 123, 'synapse_inc': 19, 'synapse_dec': 6, 'activity_reset_cnt': 7, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.1, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.46, 'enc_set_bits': 28, 'enc_sparsity': 0.96}
score: 0.9405274621078465
Saved running file JupyterNB_Rand_RUNNING.csv

Print the highest scoring fit and the parameters that achieved it.

print("#############################################")
print("Best score after %d runs: %f" % (run_count, best_score))
print("Best parameters: ", best_score_params)
print("#############################################")

#############################################
Best score after 5 runs: 0.951209
Best parameters:  {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 280, 'input_pct': 0.17, 'seed': 123, 'synapse_inc': 16, 'synapse_dec': 2, 'activity_reset_cnt': 30, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.38, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.94, 'enc_set_bits': 19, 'enc_sparsity': 0.78}
#############################################

Save the CSV with the desired name and remove the previously saved running CSV file.

df.to_csv("%s.csv" % csv_file)
print("csv saved as %s.csv" % csv_file)
if (os.path.exists(name)):
    os.remove(name)

csv saved as JupyterNB_Rand.c