Building similarity kernels from local environments

Measuring the similarity of structures becomes easy when the feature vectors represent the whole structure, such as in the case of Coulomb matrix or MBTR. In these cases the feature vectors are directly comparable with different kernels, e.g. the linear or Gaussian kernel.

Local descriptors such as SOAP or ACSF can be used in the same way to compare individual local atomic environments, but additional tools are needed to make comparison of entire structures based on local environments. This tutorial goes through two different strategies for building such global similarity measures by comparing local atomic environments between structures. For more insight, see [1].

Average kernel

The simplest approach is to average over the local contributions to create a global similarity measure. This average kernel \(K\) is defined as:

\[K(A, B) = \frac{1}{N M}\sum_{ij} C_{ij}(A, B)\]

where \(N\) is the number of atoms in structure \(A\), \(M\) is the number of atoms in structure \(B\) and the similarity between local atomic environments \(C_{ij}\) can in general be calculated with any pairwise metric (e.g. linear, gaussian).

The class AverageKernel can be used to calculate this similarity. Here is an example of calculating an average kernel for two relatively similar molecules by using SOAP and a linear and Gaussian similarity metric:

Best-match kernel

TODO

REMatch kernel

The REMatch kernel lets you choose between the best match of local environments and the averaging strategy. The parameter \(\alpha\) determines the contribution of the two: \(\alpha = 0\) means only the similarity of the best matching local environments is taken into account and \(\alpha \rightarrow \infty\) channels in the average solution. The similarity kernel \(K\) is defined as:

\[ \begin{align}\begin{aligned}\DeclareMathOperator*{\argmax}{argmax} K(A, B) &= \mathrm{Tr} \mathbf{P}^\alpha \mathbf{C}(A, B)\\\mathbf{P}^\alpha &= \argmax_{\mathbf{P} \in \mathcal{U}(N, N)} \sum_{ij} P_{ij} (1-C_{ij} +\alpha \ln P_{ij})\end{aligned}\end{align} \]

where the similarity between local atomic environments \(C_{ij}\) can once again be calculated with any pairwise metric (e.g. linear, gaussian).

The class REMatchKernel can be used to calculate this similarity:

1

Sandip De, Albert P. Bartók, Gábor Csányi, and Michele Ceriotti. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys., 18(20):13754–13769, 2016. arXiv:1601.04077.