Modeling with Incomplete Observations on Mice Protein Expressions Data

Below we'll encode and model a dataset containing incomplete observations to demonstrate the NIML model's missing data resilience

#imports
import numpy as np
import matplotlib.pyplot as plt
import niml
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from niml.model import model
from niml.encoder import encoder

niml.__version__

'1.0.1'

Step 1: Read in and format the data

When working on a dataset with missing values, begin by importing it in it's raw form. In this example we'll be using the Mice Protein Expressions Dataset.

Note: The categorical features of this dataset must be removed as they are provided as alternative classification labels, not as predictive features.

#pull in raw dataset
input_data = pd.read_csv("Data_Cortex_Nuclear.csv")
#remove alternative labels
input_data.drop(["MouseID", "Genotype", "Behavior", "Treatment"], axis = 1, inplace = True)

Identify missing data symbol(s)

Next, let's confirm that the dataset contains missing values. This will also help us see if they are evenly present throughout the features of the dataset.

After running the code below, we can see that missing data is prevalent in 6 features, with 43 features only containing a couple of missing entries. While we can investigate potential missing value's connection to distinct class outcomes, to better understand the context of the problem we are solving, this ultimately does not impact our best practices in creating our model.

input_data.isna().sum().plot.bar(color='darkblue', figsize=(18, 4))
plt.xlabel('Feature')
plt.ylabel('Number of Missing values')
plt.xticks(rotation=90)
plt.rcParams['figure.dpi']=300
plt.title("Number of Missing Values per Feature", fontsize = 14)
plt.show()

missing features

Now we must identify what type of character is used to indicate missing values in the dataset. Here the default is np.nan, but it could be a NULL value or a different character. If multiple symbols are used to indicate missing values, you'll need to adjust them to be consistent across the dataset. For example, we've taken the np.nan values and replaced them with "?".

input_data.replace(np.nan, "?", inplace=True)

Functionally, these are the only adjustments that need to be applied to the data to make it compatible with the model when missing data is present. When the encoder now encounters an entry containing the missing data symbol it will not assign any set bits to that feature and will simply go on to the next feature.

Note: The missing value indicator must be a string datatype. A numeric character such as 0 cannot be used to identify missing values in the dataset. Numeric characters are read into the field as numerically encodable data values.

Encoding considerations for missing data

For datasets with few missing values, tuning the model can now proceed normally. However, in scenarios with many missing values, it may be beneficial to consider the following:

The model relies on overlap between inputs with similar set bit representations to identify when values are similar.
Due to high sparsity, set bits do not overlap with each other very often. When missing data is present and there are even fewer set bits than normal, this overlap may happen even less often.
If the model is underperforming, increasing the number of set bits and slightly decreasing sparsity may aid in improving metrics. Changing set bits and sparsity will change the resulting iSDR width. Be cautious not to create a larger iSDR than the NPU can run on hardware if speed is a concern, as the current maximum is 8k.

Why does this work? Well, if missing data is present in sufficient quantity, it artificially increases the sparsity of the model. If sparsity is too high, inputs that actually are rather similar may appear quite different from each other when presented to the NPU. Recall that encoding should make things of different classes look as different as possible, and make things of the same class look as similar as possible.

To best achieve this goal when missing data is plentiful, consider increasing set bits slightly, or decreasing sparsity slightly from what would otherwise be an appropriate starting point as suggested in our Tuning the Encoder tutorial. This will help the features of similar observations to appear more similar to each other, slightly offsetting the features where no data is available.

Infrequently, these changes may be large enough to require minor adjustments to the NPU parameters, however, in most cases, the NPU can be initialized without special consideration for missing data.

#split off final test data
model_development_data, holdout_test = train_test_split(input_data, stratify = input_data['class'], 
                                                        test_size=0.15, random_state=2010)

When missing data is left unfilled in the dataset, the optimal number of set bits appears to be 6, and sparsity was set to 7%.

No further parameter adjustments were made to accommodate missing values.

Run encoder and model on the training and validation data

Below, the full model is run on the missing dataset. Note that in the encoder initialization, missing_val_ind has been assigned with a '?'. When the encoder encounters the '?' symbol, it knows to interpret it as missing data.

Due to the datasets small size, we're using a 10 split Monte Carlo cross validation by changing the random sample of the data.

cv_accuracy = []
cv_f1 = []
n_cv=10
for i in range(n_cv):
    train, test = train_test_split(model_development_data, stratify = model_development_data['class'], 
                                    test_size=0.15, random_state=201*i)

    n_feats = 77
    my_encoder = encoder.Encoder(set_bits=6, sparsity=.07,
        field_types= ["N"] * n_feats, # numeric features
        cyclic_flags=[False]* n_feats, # none of the fields are cyclic
        spans = [0] * n_feats, # use simple/basic encoding bit-patterns
        cat_overlaps=[0] * n_feats, # N/A, data is numeric, not categorical. Set all features to 0
        cat_values=[None] * n_feats, # N/A, data is numeric, not categorical. Set all features to None
        missing_val_ind = "?"
    )
    #configure the encoder according to the training dataset's distribution
    my_encoder.config_encoder(input_data=train, label_col = -1)

    # Encode the training data -> produce encoded inputs to be sent to the NPU for learning
    train_labels, train_isdrs, sdr_width = my_encoder.encode(input_data=train, label_col= -1)
    test_labels, test_isdrs, sdr_width = my_encoder.encode(input_data=test, label_col= -1)

    my_model = model.Model(
        sdr_width=sdr_width,
        sdr_set_bits=11,
        neurons=2000,
        active_neurons=18,
        input_pct=0.6,
        synapse_inc=7,
        synapse_dec=3,
        seed=123,

        boost_frequency=2,
        boost_strength=0.3,
        boost_bend_factor=0.175,
        boost_table_length=180,

        subclass_thresh=0.4,
        min_overlap=0.1,
        force_software=False
    )

    fitting = my_model.fit(labels=train_labels, isdrs=train_isdrs, epochs=10, verbose = False)
    results= my_model.evaluate(labels=test_labels, isdrs = test_isdrs)
    training_accuracy = my_model.evaluate(labels= train_labels, isdrs = train_isdrs)
    cv_accuracy.append(results['accuracy_score'])
    cv_f1.append(results['f1_score'])
    acc_results = sum(cv_accuracy)/len(cv_accuracy)
    f1_results = sum(cv_f1)/len(cv_f1)
print("Accuracy average:", acc_results, "F1 Score average:", f1_results)
print("Accuracy Standard Deviation:", '%.10f'% np.std(cv_accuracy), ", F1 Standard Deviation:", '%.10f'% np.std(cv_f1))

Accuracy average: 0.9507246376811596 F1 Score average: 0.950247123514902
Accuracy Standard Deviation: 0.0161384474 , F1 Standard Deviation: 0.0160187858

Check model performance on test data

Performance on the final test data is computed below:

holdout_labels, holdout_isdrs, sdr_width = my_encoder.encode(input_data=holdout_test, label_col=-1)
holdout_results= my_model.evaluate(labels=holdout_labels, isdrs = holdout_isdrs)
print(holdout_results)

{'f1_score': 0.9814823999221163, 'accuracy_score': 0.9814814814814815, 'confusion_matrix': [[23, 0, 0, 0, 0, 0, 0, 0], [0, 19, 0, 0, 0, 1, 0, 0], [0, 0, 23, 0, 0, 0, 0, 0], [0, 0, 1, 19, 0, 0, 0, 0], [0, 0, 0, 0, 20, 0, 0, 0], [0, 0, 0, 0, 0, 16, 0, 0], [0, 0, 0, 1, 0, 0, 19, 0], [0, 0, 0, 0, 0, 0, 0, 20]]}

When this same dataset was treated to fill in the missing data values through imputation, the optimal set bit value was 5 and sparsity was set to 8%. The resulting model also achieved 95% on the dataset, but keep in mind that the imputed values may have introduced bias or other issues that may have caused poorer performance had the model been deployed, and such issues are not visible when the model is in development.