Modeling the Wisconsin Breast Cancer Dataset

Below, we'll walk through all the steps necessary to encode data and train our model for successful classification on the WBC dataset.

Before we begin, note that we'll be using jupyter notebooks and the FSP release 0.0.045 for this demo. 

Step 1: Read in and split the data

The Wisconsin Breast Cancer data can be pulled directly from the Scikit-Learn module as shown below. We will also use the Scikit-learn train_test_split to break the data into 75% train and 25% test.

from sklearn import datasets
from sklearn.model_selection import train_test_split

#Load the dataset
data_file = datasets.load_breast_cancer()

#Extract the data, outcome variable, and labels from the source dataset
X =data_file.data
Y =data_file.target

#Create train and test splits of the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=4)
y_train = list(map(str, y_train))
y_test = list(map(str, y_test))

Step 2: Construct the Encoder

Next we'll create an instance of the BalancedLinearEncoder with the train data.

from encoder.balanced_encoder import BalancedLinearEncoder
# Build the Balanced Linear Encoder
num_features=len(X_train[0])
set_bits=7
sparsity=0.25
my_encoder = BalancedLinearEncoder(
    data=X_train,
    set_bits=set_bits,
    sparsity=sparsity,
    field_types=["N"] * num_features,
)

Step 3: Create and train the model

Now we're ready to set up the NIML model. To instance the model, values need to be selected for the hyperparameters listed below. Note that in this demo, we're using our F34 classifier, so classification parameters are also set in this step. 

Tip: Choosing optimal values for the FSP can be challenging! To learn the most efficient way to tune your model, please refer to the tutorial How to tune the NPU Hyperparameters

from fsp.fsp import FSP
hyperparameters = {
    "encoder": my_encoder,
    "winner_func": "k_winners",
  "winner_pct": 0.1,
  "ensemble_size": 10,
    "max_neurons": 10, 
    "pos_syn_updates": (-1, 5),
    "neg_syn_updates": (1, -5),
    "post_ensemble_epochs": 10,
    "random_state": 525,
    "batch_size": None,
    "loss_func": "one_neuron",
    "predict_func": "majority_vote",
    "seed_isdrs_per_class": 1,
    "high_dis_penalty": 0,
    "low_rep_penalty": 0
}
model = FSP(**hyperparameters)

# Fit the model to the training split
model.fit(X_train, y_train)
WARNING: num_winners has been rounded-up to 1

Step 4: Get predictions

There are two ways that we can make use of the predict functionality. When a model is ready to be deployed, we can use it with unlabeled incoming data to get classifications. When a model is still being tuned, we can use it with our test data to identify specific observations that are being misclassified.  

To get predictions, run the model.predict() function

Below, we've printed out what observation the model failed to predict correctly, along with the actual correct label what our model's prediction was. This information can help us to better understand our models behavior and further tune it for superior performance. 

# Predict on the test
test_predictions = model.predict(X_test)

# Assess the test accuracy
num_correct = 0
for idx, pred in enumerate(test_predictions):
    if str(pred) == str(y_test[idx]):
        num_correct += 1
    else:
        print(f"Missed on observation {idx:3d}  GT: {repr(str(y_test[idx])):3s} Pred: {repr(str(pred))}")

print(f"Got {num_correct:3d}/{len(test_predictions):3d} correct = {100*(num_correct / len(test_predictions)):0.4f}%")
Missed on observation  43  GT: '1' Pred: '0'
Missed on observation  49  GT: '0' Pred: '1'
Missed on observation  70  GT: '1' Pred: '0'
Missed on observation  95  GT: '1' Pred: '0'
Missed on observation 110  GT: '1' Pred: '0'
Missed on observation 111  GT: '0' Pred: '1'
Missed on observation 142  GT: '0' Pred: '1'
Got 136/143 correct = 95.1049%

Feel free to check out our API documentation for more parameter details, and check out our tutorials for more details on model tuning and capabilities!