The process of selecting a good model can be time consuming as there are many parameters to tune and an infinite number of combinations to consider. Once reasonable values have been identified for the model and dataset, performing hyperparameter searches is an efficient way to explore the hyperparameter space and potentially find an even better model fit.
The Randomized Search Cross-Validation option is best run near the beginning of your hyperparameter search process, as it randomly will select a combination of values from user specified ranges which you'll provide based on your knowledge of the dataset. The niml model is then run with random combinations of these values a specified number of times without replacement. Ideally, this search will help you narrow the parameters to focus on further, at which point you can manually fine tune or attempt a different search method to refine your parameter values even further.
Code Example
The following code is a detailed demonstration on how to run RandomizedSearchCV on the NIML NPU using a Base Estimator wrapper.
Users define the the scoring, number of cross-validation jobs, number of parallel jobs, number of searches, and whether or not to include the training scores. A fit is then performed on the Base Estimator wrapper with parameters that are randomly chosen on the train and test data. Command line output then displays the results of the best estimator fit score and the parameters that lead to that score. A CSV file is also created that contains the parameters of the NPU upon running the fit, the time taken to perform the fit, and the chosen metric results.
import csv import json import numpy as np from niml.encoder import encoder from niml.model.nispooler import sp_pooler_c from sklearn.neighbors import KNeighborsClassifier from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, ShuffleSplit import pandas as pd from sklearn.utils.estimator_checks import check_estimator import time import os
Split Data for Train and Test
If you do not already have train and test dataset, load in the desired dataset and split it into a train and test subsets.
#separate data into train and test splits
#RandomSearchCV will train and validate within the training data split only
train,test = train_test_split(data, test_size=0.20)
Construct a Default Model or NPU (The Base Estimator)
Construct a default model.Model object or NPU object that will be used by the estimator. These are the values that will be used if no search range is provided to the model.
This is referred to as the Base Estimator, and it requires default values for every variable required to run the model.
estimator_p = Pooler( # Encoder/Data parameters sdr_width=784, enc_set_bits=4, enc_sparsity=0.25, enc_missing_val_ind = "?", tts_test_size = 0.20, #Set encoding train_test_split test size tts_random_state = np.random.RandomState(seed=592), #Set encoding train_test_split random_state # NPU neurons=500, active_neurons=5, input_pct=0.75, seed= 123, #Typically the arg seed synapse_inc=5, synapse_dec=1, activity_reset_cnt=50, decay_cnt_target=10, learning=True, # Boosting boost_max=0.9, boost_str=2, boost_tbl_size=21, boost_tbl_step=0.03, # Classifier classifier_ = KNeighborsClassifier(n_neighbors=3), # Set the classifier )
Define Scoring Metrics
Define the type of scoring you would like to use, as well as number of cross-validation jobs, total searches, number of jobs to run in parallel, and whether or not you'd like to training scores to be included in the resulting CSV.
SCORING = 'f1_weighted' NUM_CV_JOBS = 2 #Default is 2 NUM_SEARCHES = 1 #Default is 1 NUM_PARALLEL_JOBS = -1 #Default is all (-1) TRAIN = False # whether to include training scores or not
Define Search Ranges
Define the dictionary of hyperparameters over which you would like to perform a Randomized Search. As shown below, the parameter ranges must be specified as lists of values.
Note that Random Search's methodology is just that - random! Depending on your specified ranges, this means you could end up with non-functional models if you're not careful.
For example, if you specify neurons as [2000, 1000, 500, 200] and active neurons as [ 350, 150, 50, 25], Random Search might choose neurons = 200 and active neurons =350 for one of its runs. If this happens, you'll receive an error instead of an output since neurons must always be greater than active neurons!
param_dis = { 'neurons': [1000], 'active_neurons' : list(range(1,1000)), 'input_pct': [x/100.0 for x in range(1,99)], 'synapse_inc': list(range(1,20)), 'synapse_dec': list(range(1,20)), 'activity_reset_cnt': list(range(1,50)), 'enc_set_bits': list(range(3,30)), 'enc_sparsity': [x/100.0 for x in range(1,99)], 'boost_max': [x/100.0 for x in range(1,99)], 'boost_str': list(range(1,10)), 'boost_tbl_size': [21], 'boost_tbl_step': [x/100.0 for x in range(1,99)], }
Set Up the Search
you would like to use and pass it the Pooler object created before as the estimator, the desired parameter dictionary, the desired scoring, how many cross-validation jobs to perform, how many parallel jobs to run, whether or not you would like to return the results of the train dataset, and how many messages you would like print as searches are completed.
search = RandomizedSearchCV( estimator=estimator_p, param_distributions=param_dis, scoring=SCORING, cv=NUM_CV_JOBS, n_jobs=NUM_PARALLEL_JOBS, n_iter=NUM_SEARCHES, return_train_score=TRAIN ) print("Random search - %d searches x %d cross validations" % (NUM_SEARCHES, NUM_CV_JOBS))
Random search - 1 searches x 2 cross validations
Set up Encoder and Search Requirements
pd_df_train = pd.DataFrame(train) pd_df_test = pd.DataFrame(test) estimator_p._get_dtypes() estimator_p._configure_encoder_params() estimator_p.encoder_ = estimator_p._create_encoder(full_data) # Use full dataset as the _create_encoder performs a train_test_split and configs encoder on the train split train_labels, train_isdrs, sdr_width = estimator_p.encoder_.encode(input_data=pd_df_train.iloc[:, 1:], label_col=None) test_labels, test_isdrs, sdr_width = estimator_p.encoder_.encode(input_data=pd_df_test.iloc[:, 1:], label_col=None) # Flatten and convert into integers train_labels = [int(train[index][0]) for index in range(len(train))] # Flatten and convert into integers test_labels = [int(test[index][0]) for index in range(len(test))] df = pd.DataFrame()
Specify the number of searches
Since this search picks a random combination, it is suggested your total runs scales with the size of your dictionary in order to ensure that values across the ranges of hyperparameters get tested.
runs = 5 # Total number of runs to perform (Default 1)
Set up your CSV requirements
You'll need to specify the name of the output CSV you would like to save. You'll also need to specify the parameters you want recorded in the CSV. We recommend saving all the parameters used to build the model, not just the ones that were randomly selected from the search space.
#name your CSV file
csv_file = "JupyterNB_Rand"
#list all the parameters you want recorded in the csv file
params = ['sdr_width', 'neurons', 'active_neurons', 'input_pct', 'seed', 'synapse_inc',
'synapse_dec', 'activity_reset_cnt', 'decay_cnt_target', 'feat_count', 'boost_max',
'boost_str', 'boost_tbl_size', 'boost_tbl_step', 'enc_set_bits', 'enc_sparsity', 'enc_missing_val_ind',
'learning']
Run Your Search
Perform as many runs as declared previously by running a fit on the train data and searching for the best scoring Base Estimator.
best_score = 0 best_score_params = [] run_count = 0 while (run_count < runs): print("#############################################") print("Run %d" % run_count) search.fit(train_isdrs,train_labels) clf = search.best_estimator_ print("params:", search.best_estimator_.get_params()) print("score:", search.best_score_) run_count = run_count + 1 ### Keep track of the best overall score if search.best_score_ >= best_score: best_score = search.best_score_ best_score_params = search.best_estimator_.get_params() search.best_estimator_.save_pooler("%s_pooler.bin" % pooler_file, format="bin")
## Save the CSV after each run, so no data is lost clf_params = clf.get_params() defaults = {} for param in params: if param in clf_params: defaults[f'param_{param}'] = clf_params[param] df_ap = pd.DataFrame(search.cv_results_) for param, val in defaults.items(): if param not in df.columns: df[params] = val df = df.append(df_ap, ignore_index=True) name = csv_file + "_RUNNING.csv" df.to_csv("%s" % name) # Save the CSV in case of the user cancelling a run, or error print("Saved running file %s" % name)
############################################# Run 0 params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 824, 'input_pct': 0.75, 'seed': 123, 'synapse_inc': 6, 'synapse_dec': 9, 'activity_reset_cnt': 9, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.83, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.68, 'enc_set_bits': 10, 'enc_sparsity': 0.54} score: 0.9423720115545755 Saved running file JupyterNB_Rand_RUNNING.csv ############################################# Run 1 params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 563, 'input_pct': 0.5, 'seed': 123, 'synapse_inc': 19, 'synapse_dec': 9, 'activity_reset_cnt': 36, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.67, 'boost_str': 7, 'boost_tbl_size': 21, 'boost_tbl_step': 0.69, 'enc_set_bits': 19, 'enc_sparsity': 0.58} score: 0.9469344347598943 Saved running file JupyterNB_Rand_RUNNING.csv ############################################# Run 2 params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 620, 'input_pct': 0.14, 'seed': 123, 'synapse_inc': 4, 'synapse_dec': 10, 'activity_reset_cnt': 35, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.98, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.26, 'enc_set_bits': 16, 'enc_sparsity': 0.07} score: 0.9421523256300532 Saved running file JupyterNB_Rand_RUNNING.csv ############################################# Run 3 params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 280, 'input_pct': 0.17, 'seed': 123, 'synapse_inc': 16, 'synapse_dec': 2, 'activity_reset_cnt': 30, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.38, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.94, 'enc_set_bits': 19, 'enc_sparsity': 0.78} score: 0.9512090777358317 Saved running file JupyterNB_Rand_RUNNING.csv ############################################# Run 4 params: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 685, 'input_pct': 0.43, 'seed': 123, 'synapse_inc': 19, 'synapse_dec': 6, 'activity_reset_cnt': 7, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.1, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.46, 'enc_set_bits': 28, 'enc_sparsity': 0.96} score: 0.9405274621078465 Saved running file JupyterNB_Rand_RUNNING.csv
Print the highest scoring fit and the parameters that achieved it.
print("#############################################") print("Best score after %d runs: %f" % (run_count, best_score)) print("Best parameters: ", best_score_params) print("#############################################")
############################################# Best score after 5 runs: 0.951209 Best parameters: {'sdr_width': 784, 'neurons': 1000, 'active_neurons': 280, 'input_pct': 0.17, 'seed': 123, 'synapse_inc': 16, 'synapse_dec': 2, 'activity_reset_cnt': 30, 'decay_cnt_target': 10, 'learning': False, 'feat_count': 30, 'boost_max': 0.38, 'boost_str': 1, 'boost_tbl_size': 21, 'boost_tbl_step': 0.94, 'enc_set_bits': 19, 'enc_sparsity': 0.78} #############################################
Save the CSV with the desired name and remove the previously saved running CSV file.
df.to_csv("%s.csv" % csv_file) print("csv saved as %s.csv" % csv_file) if (os.path.exists(name)): os.remove(name)
csv saved as JupyterNB_Rand.c