Encoding is the key step where raw data is converted into a pattern-based structure that can then be interpreted and learned by our model. Here's how we recommend choosing your encoder hyperparameters:
The short version:
- Plot histograms of your features—especially those of high importance
- Check if the data for any feature is skewed and would benefit from transformations
- Visually assess how separable the classes are in these features
- For more separable data, use fewer set bits, higher sparsity values or both. For less separable data use more set bits, and a lower sparsity value
- Check that selected values are compatible with hardware size limitations
The long version
Finding the best encoder settings starts with analyzing your dataset and using insights from that to create the most effective pattern-based representation possible. Since some of the terms used in this guide are NIS model specific, we have included useful definitions at the end of this article.
By plotting a simple histogram of each feature, and coloring each class a different color, you can quickly visualize which features contain the clearest distinctions between classes. Although it’s unlikely that any single feature will have clear delineations for each class, any feature where one or more classes can be slightly separated out will help the model’s predictive power. These features will be the ones to focus on as you work to establish optimal encoding settings.
Below is a quick example using the popular Iris dataset which contains 3 classes and 4 features.
feat_cols = [1,2,3,4]
for feat in feat_cols:
sns.histplot(x=feat, data = df, hue=data.iloc[:,0])
plt.show()
In this example, we see can that, in particular, features 3 and 4 have a high degree of class separation. This means that, fewer set bits and a larger sparsity number will sufficiently capture the distinctions between classes and the similarities within classes for these specific features.
If all of the features exhibited a higher degree of overlap with each other in a similar way to feature 1, likely more set bits, a smaller sparsity value, or both would be required.
Creating histograms like these, especially for features that have been identified as the most important, is an easy way to assess how complex your dataset is, and therefore, how detailed your encoding needs to be. More separation = less set bits, bigger sparsity, less separation = more set bits, smaller sparsity.
Finally, histograms can give us an initial assessment of the distribution of our features. If they are highly skewed, transforming the features to a normal distribution and eliminating extreme outliers can help encoding better distinguish between members of different classes.
Now let's look a little deeper at what the encoder accomplishes and how set bits and sparsity impact the outcome of that encoding
Tuning the set bits and the sparsity
The two knobs we can turn to control our encoding patterns are set bits and sparsity. Let's start by looking at a visual example below. In this picture, the red dots represent set bits. Suppose you have a dataset with 4 observations represented by each horizontal line of dots. Each observation has 4 features which are separated by the vertical dotted lines in the illustration. The values for each feature are encoded into a pattern by placing the set bits in the region corresponding to the observation's values in a linear manner.
Let's start by looking at the number of set bits. In the topmost image below, each feature is encoded with only two set bits which are placed in a position that represents the feature value for each specific observation. With only two set bits representing each value, there isn't much capacity to tease out complexities within the input data. The general range the value falls in is captured and that's about it. If we begin to use more set bits for each feature, the iSDRs begin to depict the complexities of the data more thoroughly. When 4 set bits are used instead of two, patterns that appeared identical in the encoding with two set bits now have some noticeable distinctions. By the time 32 set bits are used, a very high level of detail is captured by the iSDR pattern.
However, notice that, as the number of set bits used in encoding increases, so does the total size of the iSDR made. It expands from a little over 50 to 800 (noted on the y-axis). This is because the sparsity of each iSDR was held constant (that is, the proportion of set bits to empty bit positions). The encodings above had sparsity values of 0.16, however, sparsity itself can (and should be) adjusted to best fit the problem at hand.
The series of images below shows iSDRS created by using 8 set bits but with varying levels of sparsity. When sparsity values are small, the differences between the values of each feature are magnified. The encoded patterns for each feature in the top graph look quite distinct from each other, with minimal overlap in the set bits that create the patterns. As the sparsity values increase, the groups of set bits begin to overlap more and more, increasing the visual similarity of the features in each observation.
With this knowledge, we can use the sparsity parameter to help control how distinct different observations should be from one another. Suppose the four observations above each came from the same class of data. In this case, we would want them to appear as similar to each other as possible, thus, a larger sparsity value might be better. However, if the observations come from different classes, we want to preserve these differences in our encoding by selecting a smaller sparsity value.
If this principle is not followed, the pattern representation of the input will muddle the distinction between the classes. For example, too many set bits in a less sparse vector will dilute the separation of classes in the feature. This encoding would increase the overlap between classes making them look similar and resulting in low predictive power. In this case, an encoder with a smaller sparsity value and fewer set bits will result in better representation.
Now that we understand the basic ways that set bits and sparsity change our iSDR patterns, let's consider some examples and the encoding scheme that will lead to the best outcome.
Feature transformation:
In the examples above, the features have all been normally distributed. But what happens when an input feature is heavily skewed in distribution? Usually, in a skewed feature, the target classes are only separable in a small region at the other end of the tail. This causes an encoder (with normal parameter settings) to place the majority of set bits in the central region where the classes are essentially indistinguishable from each other, and not the regions of the pattern where the two classes can best be partitioned from each other.
Figure 1: Visualized a distribution histogram of a feature where the target labels are colored differently. At the bottom are different encoder settings (set-bits/sparsity) with same number of patterns (number of buckets). Notice how feature distribution influences the encoder parameter choice. In the first case, where only 1 set-bit and sparsity of 0.1, the patterns generated are very local and unique while in the second case (same number of patterns), they were far more spread out with lots of overlap between class labels set bits.
In the example shown above, the 1st encoding makes the different classes look very different from each other, however, it also makes observations from the same class look different from each other as well! More set bits or a less sparse representation will allow the similarity between same class observations to be preserved. The second encoding makes entries from the same class look very similar, however, the distinction between classes is mostly lost because there are too many set bits, and the pattern is not sparse enough to maintain the distinction between classes. These tradeoffs are inevitable, but considering them at the outset of model building will help you create the best and most efficient model for your dataset. Note that it is advisable to transform the features that are heavily skewed in distributions to make them more normally (Gaussian) distributed.
A simple, comprehensible visual would be very useful here. Assess differences firsthand and conclude what those mean for encoding the dataset?
Examples:
Iris dataset contains 150 observations, 4 features and 3 classes. The feature distribution is as follows:
Figure 2: Iris dataset feature distributions
Above we can see that the first two features contain a lot of overlap between the classes, while the last two features are more separable. We’ll want our encoder to be granular enough to capture as much information as possible from the first 2 features, keeping in mind that less granular settings will still capture in the information in the last 2 features. For this problem, we might select sparsity of 0.2 and set bits of 4.
Consider Skewness and Outliers:
However, if all the features were more heavily skewed and classes were overlapping, we would need even more fine grained encoder, as the subtle differences may not be captured by the settings that were sufficient when a number of the classes showed separation.
This is demonstrated in the example shown in Figure 3. In that case, the encoder has to be extremely sensitive to the underlying changes in the feature values to be effectively capable of delivering important information to other parts of the system. A more sensible set of parameter settings would be sparsity = 0.08 and set_bits=6.
Figure 3: An example of a feature with an extremely skewed distribution and overlapping class labels.
Hardware Acceleration Compatibility
One final consideration: our hardware accelerated system has limits, and it will only run accelerated models when the input pattern length is less than 8,000.
Then Length of your input is computed as follows:
Set_bits/Sparsity * number of features <= 2000.
Additionally, to double check the iSDR
Creating a very sparse input or using too many set bits will cause this number to exceed the 2000 threshold and no longer be compatible on our hardware system. To this end, we recommend removing less informative features in situations where more set bits or smaller sparsity values would be optimal for the most important features.
Review of Terms Used:
Sparsity: Indicates how sparse the allocated field width is for a given feature. Equals the number of set bits / size of field width. The higher this value is, the less patterns the encoder will have for the feature (low resolution). This value usually spans from 0.08 to 0.4
Set Bits: Indicates the number of 1-bits for a given pattern in the feature. This parameter influences how similar or dissimilar two patterns are in the feature. The higher this value is, the more overlap between 2 adjacent patterns and the more similar their neuronal representations are. Typically, no less than 3 and more than 15 set bits are used.
iSDR (Input Sparse Distributed Representation): The fancy name for pattern-based representation that results from encoding. Each observation will produce one iSDR. These iSDRs will be fed into the NPU which is where the model actually “learns.” (Click here to learn more about using the NPU)
Class Separability: This can be numerically ascertained by a number of distance metrics, but here we simply use histogram visualization to assess how overlapped the features of different classes are. High separability means there is little overlap.