Noise Resiliance

The NIML model is uniquely capable of classifying noisy data, allowing lower quality data to still provide meaningful predictive insights

In this demo we will perform some experiments with injecting varying noise amounts into the data and assessing the model's level of resilience.

For the purposes of this demo, noise is defined as any value that lies outside of the "true" distribution of the data. This can include outliers as well as values that fall within the range of the actual data, but that are not correlated in any way to the target variable. While it might be possible to filter out some noise and remove implausible extreme values, it can take lengthy analysis and a high level of subject matter expertise to do this in a thoughtful way.

Just like with missing data, noise can impact specific features or classes more heavily than others. In this demo, we'll be looking at noise that is present randomly throughout the dataset, although the same principles hold across noise impacting a subset of features or classes.

Step 1: Format data and make it noisy

First things first, let's set up our import statements:

import numpy as np
import pandas as pd

import niml
from niml.model import model
from niml.encoder import encoder

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

print(niml.__version__) #use this to double check the version of niml that we're running

'0.7.1'

Next, we'll "inject" noise into an existing dataset. This is accomplished by considering every numeric entry in the dataset as a replaceable candidate. Then, a percentage of those entries are selected by randomly choosing the row and column indices where noisy values will be placed.

The actual noisy values are elected uniformly from a range of values that go up to five standard deviations above and below the mean of the feature values being replaced. This means that the numbers will be slightly scaled to make sense with the underlying data, but some values will be outliers too. We've also performed this experiment using several smaller ranges of standard deviation, and recommend you test it out on your own too!

def inject_noise(df, perc):
    df = df.copy()
    means = df.mean(axis=0)
    stdev = df.std(axis = 0)
    size = int(perc*df.shape[0]*df.shape[1])
    row_inds = np.random.randint(0, df.shape[0]-1, size=size)
    col_inds = np.random.randint(0, df.shape[1]-1, size=size)

    for i in range(size):
        #print(col_inds[i])
        df.iloc[row_inds[i], col_inds[i]] = np.random.uniform(means[col_inds[i]] - 5*stdev[col_inds[i]], means[col_inds[i]] + 5*stdev[col_inds[i]], 1)
    return df

Now it's time to format the dataset. We're going to be using a cleaned version of the Mice Protein Expression Dataset. This raw data contains quite a few missing values (check out this demo to see our model's performance on this data with missing values included), but today we'll be filling them with KNN imputation method. Our model can easily accommodate missing values without any imputation, but, for this demo, we are focusing solely on the noise resilience feature of the model. The code below fills the missing values and formats them so they're ready to go.

### put in imputation function here
df = pd.read_csv("Data_Cortex_Nuclear.csv")
df = df.drop(columns = ["MouseID", "Genotype", 'Treatment', "Behavior"])
null_feats = df.isna().sum().loc[df.isna().sum()>0].index
df_clean = df.copy()

non_null_feats = df.isna().sum().loc[df.isna().sum()==0].index
non_null_feats = non_null_feats[1:-4]
null_feats = df.isna().sum().loc[df.isna().sum()>0].index


for null_feat in null_feats:
    df_train, y_train = df.loc[df[null_feat].notna(), non_null_feats], df.loc[df[null_feat].notna(), null_feat]
    df_test = df.loc[df[null_feat].isnull(), non_null_feats]
    knn = KNeighborsRegressor()
    knn.fit(df_train, y_train)
    df_clean.loc[df[null_feat].isnull(), null_feat] = knn.predict(df_test)
    
df = df_clean

To make it easier to apply our inject_noise function, we'll split off the labels from the rest of the dataset and split off the final test data. For any given noise threshold, we can check on the test data to see how well the model performs.

X, y = df.drop('class', axis=1), df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=400, test_size=0.20)
n_feats = 77
n_cv = 3

Step 2: Encode the Data and Fit the Model

In the cell below, we'll loop over a wide variety of noise levels. Each model will be run through three cross-validation cycles to give us clear view of the model's average performance. The average of these values will be saved and plotted so we can visualize the model's performance as the noise in the data increases.

Note that nothing specific has been done to help the model perform better as the noise increases. Remember that noise resilience stems from the sparse pattern-based encoding of our inputs. If you suspect your data is quite noisy, make sure your input encoding has a small enough sparsity value. When sparsity is present in the input pattern, the model is less likely to learn the patterns that correspond to noise, and will instead focus on learning the segments of the pattern that the inputs have in common as these will be the regions where similar patterns are produced.

acc_avg, f1_avg = [], []
for perc in range(0, 91, 5): #this range determines the proportion of values that will be replaced with noisy values 
    cv_accuracy, cv_f1 = [], []
    cv_nn_acc, cv_nn_f1 = [], []
    for i in range(n_cv):
        #below we're injecting the amount of noise specified for this loop
        X_test_missings = inject_noise(X_train, perc / 100.0)
        X_s1, X_s2, y_s1, y_s2 = train_test_split(X_test_missings, y_train, stratify=y_train,
                                                    random_state=2010*i,
                                                    test_size=0.20)
        #The encoder is configured to the data presented in this loop 
        my_encoder = encoder.Encoder(set_bits=5, sparsity=.09, 
            field_types= ["N"] * n_feats, # numeric features
            cyclic_flags=[False]* n_feats, # none of the fields are cyclic
            spans = [0] * n_feats, # use simple/basic encoding bit-patterns
            cat_overlaps=[0] * n_feats, # N/A, data is numeric, not categorical. Set all features to 0
            cat_values= [None] * n_feats, # N/A, data is numeric, not categorical. Set all features to None
            missing_val_ind = "?")

        #configure the encoder according to the training dataset's distribution
        my_encoder.config_encoder(input_data=X_s1, label_col=None)


        # Encode the training data into iSDRS for the NPU
        train_labels, train_isdrs, sdr_width = my_encoder.encode(input_data=X_s1, label_col=None)
        test_labels, test_isdrs, sdr_width = my_encoder.encode(input_data=X_s2, label_col=None)



        my_model = model.Model(
            sdr_width=sdr_width,
            sdr_set_bits=13,
            neurons=2000,
            active_neurons=22,
            input_pct=0.65,
            synapse_inc=15,
            synapse_dec=3,
            seed=123,

            boost_frequency=6,
            boost_strength=0.09,
            boost_bend_factor=0.175,
            boost_table_length=21,

            subclass_thresh=0.4,
            min_overlap=0.0,
            unknown_thresh = 0.0,
        )

        #The model fit is run on the noisy data
        my_model.fit(labels=list(y_s1), isdrs=train_isdrs, epochs=10)
        preds = my_model.predict(isdrs = test_isdrs)
        cv_f1.append(f1_score(y_s2, preds, average='macro'))
        cv_accuracy.append(accuracy_score(y_s2, preds))
    acc_avg.append(np.mean(cv_accuracy))
    f1_avg.append(np.mean(cv_f1))

We used the code below to create the plot showing the level of accuracy that can be sustained as the percentage of noisy values in the dataset increases drastically.

plt.figure(figsize=(18, 8))
plt.rcParams['figure.dpi']=300
plt.plot(range(0, 91, 5), acc_avg, label='NIS Model Accuracy', linewidth= 3)
plt.axhline(0.9, linestyle='dotted')
plt.axhline(0.8, linestyle='dotted')
plt.axhline(0.7, linestyle='dotted')
plt.legend(fontsize=18)
plt.xticks(fontsize= 18)
plt.yticks(fontsize= 18)
plt.title("Noise Resiliance", fontsize=18)
plt.xlabel("Percentage of garbage values", fontsize=18)
plt.ylabel("Accuracy", fontsize=18)
plt.show()

index

As you can see, the model maintains accuracy over 90% until about 30% of the possible numeric values in the dataset have been replaced with noisy/meaningless data points.

Further analysis has suggested that accuracy can be sustained for even longer when the model.fit() is performed on fairly clean data and the evaluate is performed on very noisy data.