Building similarity kernels from local environments

Measuring the similarity of structures becomes easy when the feature vectors represent the whole structure, such as in the case of Coulomb matrix or MBTR. In these cases the feature vectors are directly comparable with different kernels, e.g. the linear or Gaussian kernel.

Local descriptors such as SOAP or ACSF can be used in the same way to compare individual local atomic environments, but additional tools are needed to make comparison of entire structures based on local environments. This tutorial goes through two different strategies for building such global similarity measures by comparing local atomic environments between structures. For more insight, see [1].

Average kernel

The simplest approach is to average over the local contributions to create a global similarity measure. This average kernel \(K\) is defined as:

\[K(A, B) = \frac{1}{N M}\sum_{ij} C_{ij}(A, B)\]

where \(N\) is the number of atoms in structure \(A\), \(M\) is the number of atoms in structure \(B\) and the similarity between local atomic environments \(C_{ij}\) can in general be calculated with any pairwise metric (e.g. linear, gaussian).

The class AverageKernel can be used to calculate this similarity. Here is an example of calculating an average kernel for two relatively similar molecules by using SOAP and a linear and Gaussian similarity metric:

from dscribe.descriptors import SOAP
from dscribe.kernels import AverageKernel

from ase.build import molecule

# We will compare two similar molecules
a = molecule("H2O")
b = molecule("H2O2")

# First we will have to create the features for atomic environments. Lets
# use SOAP.
desc = SOAP(species=[1, 6, 7, 8], r_cut=5.0, n_max=2, l_max=2, sigma=0.2, periodic=False, compression={"mode":"off"}, sparse=False)
a_features = desc.create(a)
b_features = desc.create(b)

# Calculates the similarity with an average kernel and a linear metric. The
# result will be a full similarity matrix.
re = AverageKernel(metric="linear")
re_kernel = re.create([a_features, b_features])

# Any metric supported by scikit-learn will work: e.g. a Gaussian:
re = AverageKernel(metric="rbf", gamma=1)
re_kernel = re.create([a_features, b_features])

REMatch kernel

The REMatch kernel lets you choose between the best match of local environments and the averaging strategy. The parameter \(\alpha\) determines the contribution of the two: \(\alpha = 0\) means only the similarity of the best matching local environments is taken into account and \(\alpha \rightarrow \infty\) channels in the average solution. The similarity kernel \(K\) is defined as:

\[ \begin{align}\begin{aligned}\DeclareMathOperator*{\argmax}{argmax} K(A, B) &= \mathrm{Tr} \mathbf{P}^\alpha \mathbf{C}(A, B)\\\mathbf{P}^\alpha &= \argmax_{\mathbf{P} \in \mathcal{U}(N, N)} \sum_{ij} P_{ij} (1-C_{ij} +\alpha \ln P_{ij})\end{aligned}\end{align} \]

where the similarity between local atomic environments \(C_{ij}\) can once again be calculated with any pairwise metric (e.g. linear, gaussian).

The class REMatchKernel can be used to calculate this similarity:

from dscribe.descriptors import SOAP
from dscribe.kernels import REMatchKernel

from ase.build import molecule

from sklearn.preprocessing import normalize

# We will compare two similar molecules
a = molecule("H2O")
b = molecule("H2O2")

# First we will have to create the features for atomic environments. Lets
# use SOAP.
desc = SOAP(species=["H", "O"], r_cut=5.0, n_max=2, l_max=2, sigma=0.2, periodic=False, compression={"mode":"off"}, sparse=False)
a_features = desc.create(a)
b_features = desc.create(b)

# Before passing the features we normalize them. Depending on the metric, the
# REMatch kernel can become numerically unstable if some kind of normalization
# is not done.
a_features = normalize(a_features)
b_features = normalize(b_features)

# Calculates the similarity with the REMatch kernel and a linear metric. The
# result will be a full similarity matrix.
re = REMatchKernel(metric="linear", alpha=1, threshold=1e-6)
re_kernel = re.create([a_features, b_features])

# Any metric supported by scikit-learn will work: e.g. a Gaussian.
re = REMatchKernel(metric="rbf", gamma=1, alpha=1, threshold=1e-6)
re_kernel = re.create([a_features, b_features])

[1]

Sandip De, Albert P. Bartók, Gábor Csányi, and Michele Ceriotti. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys., 18(20):13754–13769, 2016. arXiv:1601.04077.