Formulating and Implementing the t-SNE Algorithm From Scratch

Sep 11, 2023

∙ Paid

Today, you will learn every detail about t-SNE! I am excited to introduce you to Avi Chawla! He is an exceptional Data Scientist and Data Science content creator, and in this guest post, he presents the most thorough introduction to t-SNE that I have ever seen! Avi is the author of the Daily Dose of Data Science and the Daily Dose of Data Science Newsletter. Make sure to subscribe!

t-SNE is a technique to project dimensional data into a lower dimensional space. It is a dimensionality reduction technique that is often used for data visualization. We are going to cover:

Limitations of PCA
What is t-SNE?
Background: Stochastic Neighbor Embedding (SNE) algorithm
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Implementing t-SNE using NumPy
PCA vs. t-SNE
Limitations of t-SNE and Takeaways

Limitations of PCA

1) PCA for visualization

Many use PCA as a data visualization technique. This is done by projecting the given data into two dimensions and visualizing it.

While this may appear like a fair thing to do, there’s a big problem here that often gets overlooked.

As we discussed in the PCA article, after applying PCA, each new feature captures a fraction of the original variance.

This means that two-dimensional visualization will only be helpful if the first two principal components collectively capture most of the original data variance, as shown below:

If not, the two-dimensional visualization will be highly misleading and incorrect. This is because the first two components don’t capture most of the original variance well. This is depicted below:

*Using PCA for 2D visualization is not recommended*

Thus, using PCA for 2D visualizations is only recommended if the cumulative explained variance plot suggests so. If not, one should refrain from using PCA for 2D visualization.

2) PCA is a linear dimensionality reduction technique

PCA has two main steps:

Find the eigenvectors and eigenvalues of the covariance matrix.
Use the eigenvectors to project the data to another space.

👉 Why eigenvectors, you might be wondering? We discussed the origin of eigenvectors in detail in the PCA article. It is recommended to read that article before reading this article: “Formulating the Principal Component Analysis (PCA) Algorithm From Scratch“.

Projecting the data using eigenvectors creates uncorrelated features.

Nonetheless, the new features created by PCA (x′₀, x′₁, x′₂) are always a linear combination of the original features (x₀, x₁, x₂).

This is depicted below:

As shown above, every new feature in X_projected is a linear combination of the features in X.

We can also prove this experimentally.

As depicted above, we have a linearly inseparable dataset. Next, we apply PCA and reduce the dimensionality to 1. The dataset remains linearly inseparable.

On the flip side, if we consider a linearly separable dataset, apply PCA, and reduce the dimensions, we notice that the dataset remains linearly separable:

*Linearly separable dataset stays linearly separable*

This proves that PCA is a linear dimensionality reduction technique.

However, not all real-world datasets are linear. In such cases, PCA will underperform.

3) PCA only focuses on the global structure

As discussed above, PCA’s primary objective is to capture the overall (or global) data variance.

In other words, PCA aims to find the orthogonal axes along which the entire dataset exhibits the most variability.

Thus, during this process, it does not pay much attention to the local relationships between data points.

It inherently assumes that the global patterns are sufficient to represent the overall data variance.

This is demonstrated below:

Because it primarily emphasizes on the global structure, PCA is not ideal for visualizing complex datasets where the underlying structure might rely on local relationships or pairwise similarities.

In cases where the data is nonlinear and contains intricate clusters or groups, PCA can fall short of preserving these finer details, as shown in another illustration below:

*Local relationships are lost during PCA*

As depicted above, data points from different clusters may overlap when projected onto the principal components.

This leads to a loss of information about intricate relationships within and between clusters.

So far, we understand what’s specifically lacking in PCA.

While the overall approach is indeed promising, its limitations make it impractical for many real-world datasets.

What is t-SNE?

t-distributed stochastic neighbor embedding (t-SNE) is a powerful dimensionality reduction technique mainly used to visualize high-dimensional datasets by projecting them into a lower-dimensional space (typically 2-D).

As we will see shortly, t-SNE addresses each of the above-mentioned limitations of PCA:

It is well-suited for visualization.
It works well for linearly inseparable datasets.
It focuses beyond just capturing the global data relationships.

t-SNE is an improvement to the Stochastic Neighbor Embedding (SNE) algorithm. It is observed that in comparison to SNE, t-SNE is much easier to optimize.

💡 t-SNE and SNE are two different techniques, both proposed by one common author — Geoffrey Hinton.

So, before getting into the technical details of t-SNE, let’s spend some time understanding the SNE algorithm instead.

Background: Stochastic Neighbor Embedding (SNE) algorithm

To begin, we are given a high-dimensional dataset, which is difficult to visualize.

Thus, the dimensionality is higher than 3 (typically, it’s much higher than 3).

The objective is to project it to a lower dimension (say, 2 or 3), such that the lower-dimensional representation preserves as much of the local and global structure in the original dataset as possible.

*Converting a high-dimensional dataset to 2D*

What do we mean by a local and global structure?

Let’s understand!

Imaging this is our high-dimensional dataset:

💡 Of course, the above dataset is not a high-dimensional dataset. But for the sake of simplicity, let’s assume that it is.

Local structure, as the name suggests, refers to the arrangement of data points that are close to each other in the high-dimensional space.

Thus, preserving the local structure would mean that:

Red points should stay closer to other red points.
Blue points should stay closer to other blue points.
Green points should stay closer to other green points.

So, is preserving the local structure sufficient?

Absolutely not!

If we were to focus solely on preserving the local structure, it may lead to a situation where blue points indeed stay closer to each other, but they overlap with the red points, as shown below:

*No preservation of the global structure*

This is not desirable.

Instead, we also want the low-dimensional projections to capture the global structure.

Thus, preserving the global structure would mean that:

The red cluster is well separated from the other cluster.
The blue cluster is well separated from the other cluster.
The green cluster is well separated from the other cluster.

To summarize:

Preserving the local structure means maintaining the relationships among nearby data points within each cluster.
Preserving the global structure involves maintaining the broader trends and relationships that apply across all clusters.

How does SNE work?

Let’s stick to understanding this as visually as possible.

Consider the above high-dimensional dataset again.

Euclidean distance is a good measure to know if two points are close to each other or not.

For instance, in the figure above, it is easy to figure out that Point A and Point B are close to each other but Point C is relatively much more distant from Point A.

Thus, the first step of the SNE algorithm is to convert these high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities.

Step 1: Define the conditional probabilities in high-dimensional space

For better understanding, consider a specific data point in the above dataset:

Here, points that are closer to the marked point will have smaller Euclidean distances, while other points that are far away will have larger Euclidean distances.

*Euclidean distance from current points to other points*

Thus, as mentioned above, for every data point (i), the SNE algorithm first converts these high-dimensional Euclidean distances into conditional probabilities p_j|i.

Here, p_j|i represents the conditional probability that a point x_i will pick another point x_j as its neighbor.

This conditional probability is assumed to be proportional to the probability density of a Gaussian centered at x_i.

This makes intuitive sense as well. To elaborate further, consider a Gaussian centered at x_i.

It is evident from the above Gaussian distribution centered at x_i that:

For points near x_i, p_j|i will be relatively high.
For points far from x_i, p_j|i will be small.

So, to summarize, for a data point x_i, we convert its Euclidean distances to all other points x_j into conditional probabilities p_j|i.

This conditional probability is assumed to be proportional to the probability density of a Gaussian centered at x_i.

Also, as we are only interested in modeling pairwise similarities, we set the value of p_i|i= 0. In other words, a point cannot be its own neighbor.

💡 A Gaussian for a point x_i will be parameterized by two parameters: (μ_i,σ_i²). As the x-axis of the Gaussian measures Euclidean distance, the mean (μ_i) is zero. What about σ_i²? We’ll get back to it shortly. Until then, just assume that we have somehow figured out the ideal value σ_i².

Based on what we have discussed so far, our conditional probabilities p_j|i may be calculated using a Gaussian probability density function as follows:

However, there’s a problem here.

Yet again, let’s understand this visually.

Earlier, we considered the data points in the same cluster to be closely packed.

Thus, the resultant conditional probabilities were also high:

However, the data points of a cluster might be far from other clusters. Yet, they themselves can be a bit more scattered, as depicted below:

If we were to determine the resultant conditional probabilities p_j|i for the marked data point x_i, we will get:

In this case, even though the data points belong to the same cluster, the conditional probabilities are much smaller than what we had earlier.

We need to fix this, and a common way to do this is by normalizing the individual conditional probability between (x_i, x_j) → p_j|i by the sum of all conditional probabilities p_k|i.

Thus, we can now estimate the final conditional probability p_j|i as follows:

The numerator is the conditional probability between (x_i, x_j) → p_j|i.
Each of the terms inside the summation in the denominator is the conditional probability between (x_i, x_k) → p_k|i.

To reiterate, we are only interested in modeling pairwise similarities. Thus, we set the value of p_i|i = 0. In other words, a point cannot be its own neighbor.

Step 2: Define the low-dimensional conditional probabilities

Recall the objective again.

We intend to project the given high-dimensional dataset to lower dimensions (say, 2 or 3).

Thus, for every data point x_i∈ Rⁿ, we define it’s counterpart y_i∈ R²:

💡 y_i does not necessarily have to be in two dimensions. Here, we have defined y_i∈ R² just to emphasize that n is larger than 2.

Next, we use a similar notion to compute the pairwise conditional probability using a Gaussian, which we denote as q_j|i in the low-dimensional space.

Furthermore, to simplify calculations, we set the variance of the conditional probabilities q_j|i to 1/√2.

As a result, we can denote q_j|i as follows:

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.