使用嵌入使复杂的数据变得简单| Toptal®-356平台

处理非数值数据可能很困难，即使对于 experienced data scientists. 典型的机器学习模型期望它的特征是数字, 而不是通过语言, 电子邮件, 网站页面, 列表, 图, or probability distributions. 为了更有用，数据必须首先转换成向量空间. 但如何?

One popular approach would be to treat a non-numerical feature as categorical. 如果类别数量较少(例如, if data indicates a profession or a country). 然而, if we try to apply this method to 电子邮件, 我们可能会得到尽可能多的类别，因为有样本. No two 电子邮件 are exactly the same, hence this approach would be of no use.

Another approach would be to define a distance between data samples这个函数告诉我们任意两个样本的距离有多近. 或者我们可以定义a 相似性度量, which would give us the same in为mation except that the distance between two close samples is small while similarity is large. Computing distance (similarity) between all data samples would give us a distance (or similarity) matrix. This is numerical data we could use.

然而, 这个数据有多少个样本就有多少个维度, 如果我们想把它作为一种功能，这通常不是很好 curse of dimensionality) or to visualize it (while one plot can h和le even 6D, I have yet to see a 100D plot). 我们可以把尺寸减少到一个合理的数量吗?

答案是肯定的! 这就是我们得到的 嵌入的 为.

What Is an Embedding 和为什么 Use It?

嵌入是高维数据的低维表示. Typically, an embedding won’t capture all in为mation contained in the original data. A good embedding, however, will capture enough to solve the problem at h和.

存在许多针对特定数据结构定制的嵌入. 例如, you might have heard of word2vec 对于文本数据，或傅里叶描述符用于形状图像数据. 而不是, we will discuss how to apply 嵌入的 to any data where we can define a distance or a 相似性度量. As long as we can compute a distance matrix, the nature of data is completely irrelevant. It will work the same, be it 电子邮件, 列表, 树，或者网页.

在本文中, we will introduce you to different types of embedding 和 discuss how some popular 嵌入的 work 和 how we could use 嵌入的 to solve real-world problems involving complex data. We will also go through the pros 和 cons of this method, as well as some alternatives. 是的, some problems can be solved better by other means, 但不幸的是, there is no silver bullet in machine learning.

让我们开始吧.

嵌入是如何工作的

All 嵌入的 attempt to reduce the dimensionality of data while preserving “essential” in为mation in the data, but every embedding does it in its own way. 在这里, we will go through a few popular 嵌入的 that can be applied to a distance or similarity matrix.

我们甚至不会试图涵盖所有的嵌入. There are at least a dozen well-known 嵌入的 that can do that 和 many more lesser-known 嵌入的和 their variations. 每种方法都有自己的方法、优点和缺点.

If you’d like to see what other 嵌入的 are out there, you could start here:

Scikit-learn User Guide
统计学习的要素(第二版)第十四章

距离矩阵

Let’s briefly touch on distance matrices. Finding an appropriate distance 为 data requires a good underst和ing of the problem, some knowledge of math, 和 有时纯粹是运气. In the approach described in this article, that might be the most important factor contributing to the overall success or failure of your project.

You should also keep a few technical details in mind. 许多嵌入算法都会假设距离(或不同)矩阵$\textbf{D}$的对角线上有零并且是对称的. If it’s not symmetric, we can use $(\textbf{D} + \textbf{D}^T) / 2$ instead. 的演算法内核的技巧 will also assume that a distance is a metric, which means that the triangle inequality holds:

\[\为all a, b, c \;\; d(a,c) \leq d(a,b) + d(b,c)\]

也, if an algorithm requires a similarity matrix instead, we could apply any monotone-decreasing function to trans为m a distance matrix to a similarity matrix: 为 example, \ exp - x美元.

Principal Component Analysis (PCA)

Principal Component AnalysisPCA可能是迄今为止使用最广泛的嵌入方法. 这个想法很简单: Find a linear trans为mation of features that maximizes the captured variance or (equivalently) minimizes the quadratic 重建 error.

具体地说, let features be a sample matrix $\textbf{X} \in \mathbb{R}^{n \times p}$ have $n$ features 和 $p$ dimensions. 为简单起见，我们假设数据样本均值为零. We can reduce the number of dimensions from $p$ to $q$ by multiplying $\textbf{X}$ by an orthonormal matrix $\textbf{V}_q \in \mathbb{R}^{p \times q}$:

\[\hat{\textbf{X}} = \textbf{X} \textbf{V}_q\]

Then, $\hat{\textbf{X}} \in \mathbb{R}^{n \times q}$ will be the new set of features. To map the new features back to the original space (this operation is called 重建)，我们只需要再乘以$\textbf{V}_q^T$.

Now, we are to find the matrix $\textbf{V}_q$ that minimizes the 重建 error:

\[\min_{\textbf{V}_q} ||\textbf{X}\textbf{V}_q\textbf{V}_q^T - \textbf{X}||^2\]

Columns of matrix $\textbf{V}_q$ are called principal component directions, $\hat{\textbf{X}}$的列称为主成分. 数值, 通过对$\textbf{X}$进行svd分解，可以得到$\textbf{V}_q$, 尽管还有其他同样有效的方法.

PCA can be applied directly to numerical features. Or, if our features are non-numerical, we can apply it to a distance or similarity matrix.

If you use Python, PCA is implemented in scikit-learn.

的优势 of this method is that it is fast to compute 和 quite robust to noise in data.

缺点 would be that it can only capture linear structures, so non-linear in为mation contained in the original data is likely to be lost.

核主成分分析

核主成分分析 is a non-linear version of PCA. 这个想法是使用 内核的技巧, which you have probably heard of if you are familiar with Support Vector Machines 支持向量机.

具体来说，存在几种不同的方法来计算PCA. One of them is to compute eigendecomposition of the double-centered version of gram matrix $\textbf{X} \textbf{X}^T \in \mathbb{R}^{n \times n}$. 现在，如果计算a 内核矩阵 $\textbf{K} \in \mathbb{R}^{n \乘以n}$用于我们的数据, 核主成分分析 will treat it as a gram matrix in order to find principal components.

设$x_i$， $i \in {1，..,n}$ be the feature samples. Kernel matrix is defined by a kernel function $K(x_i,x_j)=\langle \phi(x_i),\phi(x_j) \rangle$.

A popular choice is a radial kernel:

\[K(x_i,x_j)=\exp -\gamma \cdot d(x_i,x_j)\]

where $d$ is a distance function.

核主成分分析 required us to specify a distance. 例如, 对于数值特征, we could use Euclidean distance: $d(x_i,x_j)=\vert\vert x_i-x_j \vert \vert ^2$.

对于非数值特征，我们可能需要更有创造性. One thing to remember is that this algorithm assumes our distance to be a metric.

If you use Python, 核主成分分析 is implemented in scikit-learn.

的优势 of the 核主成分分析 method is that it can capture non-linear data structures.

缺点 is that it is sensitive to noise in data 和 that the choice of distance 和 kernel functions will greatly affect the results.

Multidimensional Scaling (MDS)

Multidimensional scaling MDS试图在全局范围内保持样本之间的距离. 这个想法非常直观，并且可以很好地用于距离矩阵.

具体来说，给定特征样本$x_i$， $i \in {1，..,n}$ 和 a distance function $d$, 我们计算新的特征样本$z_i \ \在\mathbb{R}^{q}$中, $i \in {1,..，n}$通过最小化a 应力函数:

\ [\ min_ {z_1,..,z_n} \sum_{1 \leq i < j \leq n} (d(x_i, x_j) - ||z_i - z_j||)^2\]

If you use Python, MDS is implemented in scikit-learn. 然而, Scikit-learn不支持样本外点的变换, which could be inconvenient if we want to use an embedding in conjunction with a regression or classification model. In principle, however, 这是可能的.

的优势 of MDS is that its idea accords perfectly with our framework 和 that it is not much affected by noise in data.

缺点 is that its implementation in scikit-learn is quite slow 和 does not support out-of-sample trans为mation.

Use Case: Shipment Tracking

A few settlements on a small tropical isl和 have developed parcel shipment services to cater to the local tourism industry. A merchant in one of these settlements decided to take action to gain an edge over the competition, so he set up a satellite surveillance system tracking all package shipments on the isl和. Once the data was collected, the merchant called a data scientist (that’s us!) to help him to answer the following question: Can we predict the destination of a package that is currently en route?

该数据集包含200个跟踪货物的信息. For every tracked shipment, 有一个x的列表,Y)-发现包裹的所有地点的坐标, 通常在20到50次观测之间. The plot below shows how this data looks.

这些数据看起来像是麻烦——实际上是两种不同类型的麻烦.

第一个问题是我们处理的数据是高维的. 例如, if every package was spotted at 50 locations, our data would have 100 dimensions—sounds like a lot, compared to the 200 samples at your disposition.

The second problem: Different shipment routes actually have a different number of observations, so we cannot simply stack the 列表 with coordinates to represent the data in a tabular 为m (和 even if they had, that still wouldn’t really make sense).

商人不耐烦地用手指敲打着桌子, 这位数据科学家正努力不表现出任何恐慌的迹象.

这就是距离矩阵和嵌入将派上用场的地方. 我们只需要找到一种比较两条运输路线的方法. 邻的距离 seems to be a reasonable choice. With a distance, we can compute a distance matrix.

注意: This step might take a while. We need to compute $O(n^2)$ distances with each distance having $O(k^2)$ iterations, where $n$ is the number of samples 和 $k$ is the number of observations in one sample. Writing a distance function efficiently is key. 例如, in Python, you could use numba to accelerate this computation manyfold.

Visualizing 嵌入的

Now, we can use an embedding to reduce the number of dimensions from 200 to just a few. 我们可以清楚地看到，只有几条贸易路线, so we may hope to find a good representation of the data even in two or three dimensions. 我们将使用前面讨论过的嵌入:PCA、核主成分分析和MDS.

如下图所示, you can see the labeled route data (given 为 the sake of demonstration) 和 its representation by an embedding in 2D 和 3D (from left to right). 标记的数据标记了由六条贸易路线连接的四个贸易站. Two of the six trade routes are bidirectional, which makes eight shipment groups in total (6+2). As you can see, we got a pretty clear separation of all the eight shipment groups with 3D 嵌入的.

这是一个良好的开端.

嵌入的 in a Model Pipeline

Now, we are ready to train an embedding. Al虽然 MDS showed the best results, it is rather slow; also, scikit-learn’s implementation does not support out-of-sample trans为mation. It’s not a problem 为 research but it can be 为 production, so we will use 核主成分分析 instead. For 核主成分分析, we should not 为get to apply a radial kernel to the distance matrix be为eh和.

How do you select the number of output dimensions? The analysis showed that even 3D works okay. 为了安全起见，不要遗漏任何重要信息, let’s set the embedding output to 10D. For the best per为mance, the number of output dimensions can be set as a model hyper-parameter 和 then tuned by cross-validation.

So, we will have 10 numerical features that we can use as an input 为 pretty much any classification model. How about one linear 和 one non-linear model: say, 逻辑回归和梯度增加? For comparison, let’s also use these two models with a full distance matrix as the input. 最重要的是, let’s test 支持向量机 too (支持向量机 is designed to work with a distance matrix directly, so no embedding would be required).

The model accuracy on the test set is shown below (10 train 和 test datasets were generated so we could estimate the variance of the model):

梯度增加 与嵌入(KernelPCA+GB)配对获得第一名. 它优于没有嵌入的梯度增强(GB)。. 在这里, 核主成分分析 proved to be useful.
逻辑回归 做的好的. What’s interesting is that 逻辑回归 with no embedding (LR) did better than with an embedding (KernelPCA+LR). This is not entirely unexpected. 线性模型不是很灵活，但相对难以过拟合. 在这里, the loss of in为mation caused by an embedding seems to outweigh the benefit of smaller input dimensionality.
最后但同样重要的， 支持向量机 per为med well too, al虽然 the variance of this model is quite significant.

模型的准确性

The Python code 为 this use case 在GitHub.

结论

We’ve explained what 嵌入的 are 和 demonstrated how they can be used in conjunction with distance matrices to solve real-world problems. 裁决时间到了:

数据科学家应该使用嵌入吗? Let’s take a look at both sides of the story.

优点 & Cons of Using 嵌入的

优点:

This approach allows us to work with unusual or complex data structures as long as you can define a distance, which—with a certain degree of knowledge, 想象力, 和 luck—you usually can.
The output is low-dimensional numerical data, which you can easily analyze, 集群, or use as model features 为 pretty much any machine learning model out there.

缺点:

使用这种方法，我们必然会丢失一些信息:
- During the first step, when we replace original data with similarity matrix
- 在第二步中，当我们使用嵌入降维时
Depending on data 和 distance function, computation of a distance matrix may be time-consuming. 这可以通过有效地编写距离函数来减轻.
一些嵌入的 are very sensitive to noise in data. This may be mitigated by additional data cleaning.
有些嵌入对超参数的选择很敏感. 这可以通过仔细分析或超参数调优来缓解.

Alternatives: 为什么 Not Use…?

为什么 not just use an embedding directly on data, rather than a distance matrix?
If you know an embedding that can efficiently encode your data directly, by all means, use it. The problem is that it does not always exist.
为什么 not just use 集群ization on a distance matrix?
If your only goal is to segment your dataset, it would be totally okay to do so. 一些集群ization methods leverage 嵌入的 too (为 example, 谱聚类). If you’d like to learn more, here is a tutorial on 集群ization.
为什么 not just use a distance matrix as features?
距离矩阵的大小为$(n_{samples}， n_{samples})$. 并不是所有的模型都能有效地处理它——有些模型可能会过拟合, some may be slow to fit, some may fail to fit altogether. Models with low variance would be a good choice here, such as linear 和/or regularized models.
为什么 not just use 支持向量机 with a distance matrix?
支持向量机是一个很好的模型，在我们的用例中表现得很好. 然而, there are some caveats. 第一个, 如果我们想添加其他功能(可能只是简单的数字), we won’t be able to do it directly. We’d have to incorporate them into our similarity matrix 和 potentially lose some valuable in为mation. Second, as good as 支持向量机 is, another model may work better 为 your particular problem.
为什么 not just use deep learning?
这是真的，对于任何问题，都可以找到合适的神经网络 if you search long enough. 记住, 虽然, that the process of finding, 培训, 验证, 部署这种神经网络并不一定是一件简单的事. So, as always, use your best judgment.

一句话

嵌入的 in conjunction with distance matrices are an incredibly useful tool if you happen to work with complex non-numerical data, especially when you cannot trans为m your data into a vector space directly 和 would prefer to have a low-dimensional input 为 your model.

Underst和ing the basics

什么是嵌入??
嵌入是数据的低维表示. 例如, 世界地图是地球三维表面的二维表示, 和 a Discrete Fourier series is a finite-dimensional representation of an infinite-dimensional sound wave.
What is the purpose of 嵌入的?
An embedding can reduce the number of data dimensions while preserving important internal relationships within the data. 例如, a world map preserves relative positions of terrains 和 oceans.
How are 嵌入的 trained?
机器学习中的嵌入算法通常属于无监督学习. They work on unlabeled data but require manually setting hyper-parameters, such as number of output dimensions.
为什么 are 嵌入的 important?
High-dimensional data can be difficult to analyze, plot, or use to train an ML model. An embedding can reduce the number of dimensions 和 greatly simplify these tasks 为 a data scientist.
Do 嵌入的 work with non-numerical data?
有些嵌入是专门设计用于处理非数值数据的. 例如，著名的嵌入word2vec将单词转换为向量. This article shows how 嵌入的 can work with non-numerical data in very general settings.

作者简介

Yaroslav是一位具有商业分析经验的数据科学家, 预测建模, 数据可视化, 数据编配, 和部署.

authors are vetted experts in their fields 和 write on topics in which they have demonstrated experience. All of our content is peer reviewed 和 validated by Toptal experts in the same field.

专业知识

机器学习数据科学

以前在

雇佣他