The NIML model is uniquely capable of classifying noisy data, allowing lower quality data to still provide meaningful predictive insights
In this demo we will perform some experiments with injecting varying noise amounts into the data and assessing the model's level of resilience.
For the purposes of this demo, noise is defined as any value that lies outside of the "true" distribution of the data. This can include outliers as well as values that fall within the range of the actual data, but that are not correlated in any way to the target variable. While it might be possible to filter out some noise and remove implausible extreme values, it can take lengthy analysis and a high level of subject matter expertise to do this in a thoughtful way.
Just like with missing data, noise can impact specific features or classes more heavily than others. In this demo, we'll be looking at noise that is present randomly throughout the dataset, although the same principles hold across noise impacting a subset of features or classes.
Step 1: Format data and make it noisy
First things first, let's set up our import statements:
import numpy as np import pandas as pd import niml from niml.model import model from niml.encoder import encoder from sklearn.neighbors import KNeighborsRegressor from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix, f1_score, accuracy_score print(niml.__version__) #use this to double check the version of niml that we're running
'0.7.1'
Next, we'll "inject" noise into an existing dataset. This is accomplished by considering every numeric entry in the dataset as a replaceable candidate. Then, a percentage of those entries are selected by randomly choosing the row and column indices where noisy values will be placed.
The actual noisy values are elected uniformly from a range of values that go up to five standard deviations above and below the mean of the feature values being replaced. This means that the numbers will be slightly scaled to make sense with the underlying data, but some values will be outliers too. We've also performed this experiment using several smaller ranges of standard deviation, and recommend you test it out on your own too!
def inject_noise(df, perc): df = df.copy() means = df.mean(axis=0) stdev = df.std(axis = 0) size = int(perc*df.shape[0]*df.shape[1]) row_inds = np.random.randint(0, df.shape[0]-1, size=size) col_inds = np.random.randint(0, df.shape[1]-1, size=size) for i in range(size): #print(col_inds[i]) df.iloc[row_inds[i], col_inds[i]] = np.random.uniform(means[col_inds[i]] - 5*stdev[col_inds[i]], means[col_inds[i]] + 5*stdev[col_inds[i]], 1) return df
Now it's time to format the dataset. We're going to be using a cleaned version of the Mice Protein Expression Dataset. This raw data contains quite a few missing values (check out this demo to see our model's performance on this data with missing values included), but today we'll be filling them with KNN imputation method. Our model can easily accommodate missing values without any imputation, but, for this demo, we are focusing solely on the noise resilience feature of the model. The code below fills the missing values and formats them so they're ready to go.
### put in imputation function here df = pd.read_csv("Data_Cortex_Nuclear.csv") df = df.drop(columns = ["MouseID", "Genotype", 'Treatment', "Behavior"]) null_feats = df.isna().sum().loc[df.isna().sum()>0].index df_clean = df.copy() non_null_feats = df.isna().sum().loc[df.isna().sum()==0].index non_null_feats = non_null_feats[1:-4] null_feats = df.isna().sum().loc[df.isna().sum()>0].index for null_feat in null_feats: df_train, y_train = df.loc[df[null_feat].notna(), non_null_feats], df.loc[df[null_feat].notna(), null_feat] df_test = df.loc[df[null_feat].isnull(), non_null_feats] knn = KNeighborsRegressor() knn.fit(df_train, y_train) df_clean.loc[df[null_feat].isnull(), null_feat] = knn.predict(df_test) df = df_clean
To make it easier to apply our inject_noise function, we'll split off the labels from the rest of the dataset and split off the final test data. For any given noise threshold, we can check on the test data to see how well the model performs.
X, y = df.drop('class', axis=1), df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=400, test_size=0.20)
n_feats = 77
n_cv = 3
Step 2: Encode the Data and Fit the Model
In the cell below, we'll loop over a wide variety of noise levels. Each model will be run through three cross-validation cycles to give us clear view of the model's average performance. The average of these values will be saved and plotted so we can visualize the model's performance as the noise in the data increases.
Note that nothing specific has been done to help the model perform better as the noise increases. Remember that noise resilience stems from the sparse pattern-based encoding of our inputs. If you suspect your data is quite noisy, make sure your input encoding has a small enough sparsity value. When sparsity is present in the input pattern, the model is less likely to learn the patterns that correspond to noise, and will instead focus on learning the segments of the pattern that the inputs have in common as these will be the regions where similar patterns are produced.
acc_avg, f1_avg = [], []
for perc in range(0, 91, 5): #this range determines the proportion of values that will be replaced with noisy values
cv_accuracy, cv_f1 = [], []
cv_nn_acc, cv_nn_f1 = [], []
for i in range(n_cv):
#below we're injecting the amount of noise specified for this loop
X_test_missings = inject_noise(X_train, perc / 100.0)
X_s1, X_s2, y_s1, y_s2 = train_test_split(X_test_missings, y_train, stratify=y_train,
random_state=2010*i,
test_size=0.20)
#The encoder is configured to the data presented in this loop
my_encoder = encoder.Encoder(set_bits=5, sparsity=.09,
field_types= ["N"] * n_feats, # numeric features
cyclic_flags=[False]* n_feats, # none of the fields are cyclic
spans = [0] * n_feats, # use simple/basic encoding bit-patterns
cat_overlaps=[0] * n_feats, # N/A, data is numeric, not categorical. Set all features to 0
cat_values= [None] * n_feats, # N/A, data is numeric, not categorical. Set all features to None
missing_val_ind = "?")
#configure the encoder according to the training dataset's distribution
my_encoder.config_encoder(input_data=X_s1, label_col=None)
# Encode the training data into iSDRS for the NPU
train_labels, train_isdrs, sdr_width = my_encoder.encode(input_data=X_s1, label_col=None)
test_labels, test_isdrs, sdr_width = my_encoder.encode(input_data=X_s2, label_col=None)
my_model = model.Model(
sdr_width=sdr_width,
sdr_set_bits=13,
neurons=2000,
active_neurons=22,
input_pct=0.65,
synapse_inc=15,
synapse_dec=3,
seed=123,
boost_frequency=6,
boost_strength=0.09,
boost_bend_factor=0.175,
boost_table_length=21,
subclass_thresh=0.4,
min_overlap=0.0,
unknown_thresh = 0.0,
)
#The model fit is run on the noisy data
my_model.fit(labels=list(y_s1), isdrs=train_isdrs, epochs=10)
preds = my_model.predict(isdrs = test_isdrs)
cv_f1.append(f1_score(y_s2, preds, average='macro'))
cv_accuracy.append(accuracy_score(y_s2, preds))
acc_avg.append(np.mean(cv_accuracy))
f1_avg.append(np.mean(cv_f1))
We used the code below to create the plot showing the level of accuracy that can be sustained as the percentage of noisy values in the dataset increases drastically.
plt.figure(figsize=(18, 8))
plt.rcParams['figure.dpi']=300
plt.plot(range(0, 91, 5), acc_avg, label='NIS Model Accuracy', linewidth= 3)
plt.axhline(0.9, linestyle='dotted')
plt.axhline(0.8, linestyle='dotted')
plt.axhline(0.7, linestyle='dotted')
plt.legend(fontsize=18)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.title("Noise Resiliance", fontsize=18)
plt.xlabel("Percentage of garbage values", fontsize=18)
plt.ylabel("Accuracy", fontsize=18)
plt.show()
As you can see, the model maintains accuracy over 90% until about 30% of the possible numeric values in the dataset have been replaced with noisy/meaningless data points.
Further analysis has suggested that accuracy can be sustained for even longer when the model.fit() is performed on fairly clean data and the evaluate is performed on very noisy data.