Smooth Overlap of Atomic Positions

Smooth Overlap of Atomic Positions (SOAP) is a descriptor that encodes regions of atomic geometries by using a local expansion of a gaussian smeared atomic density with orthonormal functions based on spherical harmonics and radial basis functions.

The SOAP output from DScribe is the partial power spectrum vector \(\mathbf{p}(\mathbf{r})\), where the elements are defined as [1]

\[p(\mathbf{r})^{Z_1 Z_2}_{n n' l} = \pi \sqrt{\frac{8}{2l+1}}\sum_m c^{Z_1}_{n l m}(\mathbf{r})^*c^{Z_2}_{n' l m}(\mathbf{r})\]

where \(n\) and \(n'\) are indices for the different radial basis functions up to \(n_\mathrm{max}\), \(l\) is the angular degree of the spherical harmonics up to \(l_\mathrm{max}\) and \(Z_1\) and \(Z_2\) are atomic species.

The coefficients \(c^Z_{nlm}\) are defined as the following inner products:

\[c^Z_{nlm}(\mathbf{r}) =\iiint_{\mathcal{R}^3}\mathrm{d}V g_{n}(r)Y_{lm}(\theta, \phi)\rho^Z(\mathbf{r}).\]

where \(\mathbf{r}\) is a position in space, \(\rho^Z(\mathbf{r})\) is the gaussian smoothed atomic density for atoms with atomic number \(Z\), \(Y_{lm}(\theta, \phi)\) are the real spherical harmonics, and \(g_{n}(r)\) is the radial basis function.

For the radial degree of freedom the selection of the basis function \(g_{n}(r)\) is not as trivial and multiple approaches may be used. By default the DScribe implementation uses spherical gaussian type orbitals as radial basis functions [2], as they allow much faster analytic computation. We however also include the possibility of using the original polynomial radial basis set [3].

The spherical harmonics definition used by DScribe is based on real (tesseral) spherical harmonics. This real form spans the same space as the complex version, and is defined as a linear combination of the complex basis. As the atomic density is a real-valued quantity (no imaginary part) it is natural and computationally easier to use this form that does not require complex algebra.

The SOAP kernel [3] between two atomic environments can be retrieved as a normalized polynomial kernel of the partial powers spectrums:

\[K^\mathrm{SOAP}(\mathbf{p}, \mathbf{p'}) = \left( \frac{\mathbf{p} \cdot \mathbf{p'}}{\sqrt{\mathbf{p} \cdot \mathbf{p}~\mathbf{p'} \cdot \mathbf{p'}}}\right)^{\xi}\]

Although this is the original similarity definition, nothing in practice prevents the usage of the output in non-kernel based methods or with other kernel definitions.

The partial SOAP spectrum ensures stratification of the output by species and also provides information about cross-species interaction. See the get_location() method for a way of easily accessing parts of the output that correspond to a particular species combination. In pseudo-code the ordering of the output vector is as follows:

for Z in atomic numbers in increasing order:
   for Z' in atomic numbers in increasing order:
      for l in range(l_max+1):
         for n in range(n_max):
            for n' in range(n_max):
               if n' >= n and Z' >= Z:
                  append p(\chi)^{Z Z'}_{n n' l}` to output


Instantiating the object that is used to create SOAP can be done as follows:

The constructor takes the following parameters:

SOAP.__init__(rcut, nmax, lmax, sigma=1.0, rbf='gto', species=None, periodic=False, crossover=True, average='off', sparse=False)[source]
  • rcut (float) – A cutoff for local region in angstroms. Should be bigger than 1 angstrom.

  • nmax (int) – The number of radial basis functions.

  • lmax (int) – The maximum degree of spherical harmonics.

  • species (iterable) – The chemical species as a list of atomic numbers or as a list of chemical symbols. Notice that this is not the atomic numbers that are present for an individual system, but should contain all the elements that are ever going to be encountered when creating the descriptors for a set of systems. Keeping the number of chemical species as low as possible is preferable.

  • sigma (float) – The standard deviation of the gaussians used to expand the atomic density.

  • rbf (str) –

    The radial basis functions to use. The available options are:

    • ”gto”: Spherical gaussian type orbitals defined as \(g_{nl}(r) = \sum_{n'=1}^{n_\mathrm{max}}\,\beta_{nn'l} r^l e^{-\alpha_{n'l}r^2}\)

    • ”polynomial”: Polynomial basis defined as \(g_{n}(r) = \sum_{n'=1}^{n_\mathrm{max}}\,\beta_{nn'} (r-r_\mathrm{cut})^{n'+2}\)

  • periodic (bool) – Determines whether the system is considered to be periodic.

  • crossover (bool) – Determines if crossover of atomic types should be included in the power spectrum. If enabled, the power spectrum is calculated over all unique species combinations Z and Z’. If disabled, the power spectrum does not contain cross-species information and is only run over each unique species Z. Turned on by default to correspond to the original definition

  • average (str) –

    The averaging mode over the centers of interest. Valid options are:

    • ”off”: No averaging.

    • ”inner”: Averaging over sites before summing up the magnetic quantum numbers: \(p_{nn'l}^{Z_1,Z_2} \sim \sum_m (\frac{1}{n} \sum_i c_{nlm}^{i, Z_1})^{*} (\frac{1}{n} \sum_i c_{n'lm}^{i, Z_2})\)

    • ”outer”: Averaging over the power spectrum of different sites: \(p_{nn'l}^{Z_1,Z_2} \sim \frac{1}{n} \sum_i \sum_m (c_{nlm}^{i, Z_1})^{*} (c_{n'lm}^{i, Z_2})\)

  • sparse (bool) – Whether the output should be a sparse matrix or a dense numpy array.

Increasing the arguments nmax and lmax makes SOAP more accurate but also increases the number of features.


After SOAP has been set up, it may be used on atomic structures with the create()-method.

As SOAP is a local descriptor, it also takes as input a list of atomic indices or positions. If no such positions are defined, SOAP will be created for each atom in the system. The call syntax for the create-method is as follows:

SOAP.create(system, positions=None, n_jobs=1, verbose=False)[source]

Return the SOAP output for the given systems and given positions.

  • system (ase.Atoms or list of ase.Atoms) – One or many atomic structures.

  • positions (list) – Positions where to calculate SOAP. Can be provided as cartesian positions or atomic indices. If no positions are defined, the SOAP output will be created for all atoms in the system. When calculating SOAP for multiple systems, provide the positions as a list for each system.

  • n_jobs (int) – Number of parallel jobs to instantiate. Parallellizes the calculation across samples. Defaults to serial calculation with n_jobs=1.

  • verbose (bool) – Controls whether to print the progress of each job into to the console.


The SOAP output for the given systems and positions. The return type depends on the ‘sparse’-attribute. The first dimension is determined by the amount of positions and systems and the second dimension is determined by the get_number_of_features()-function. When multiple systems are provided the results are ordered by the input order of systems and their positions.

Return type

np.ndarray | scipy.sparse.csr_matrix

The output will in this case be a numpy array with shape [#positions, #features]. The number of features may be requested beforehand with the get_number_of_features()-method.


The following examples demonstrate common use cases for the descriptor. These examples are also available in dscribe/examples/

Finite systems

Adding SOAP to water is as easy as:

We are expecting a matrix where each row represents the local environment of one atom of the molecule. The length of the feature vector depends on the number of species defined in species as well as nmax and lmax. You can try by changing nmax and lmax.

Periodic systems

Crystals can also be SOAPed by simply setting the periodic keyword to True. In this case a cell needs to be defined for the ase object.

Since the SOAP feature vectors of each of the four copper atoms in the cubic unit cell match, they turn out to be equivalent.

Locating information

The SOAP class provides the get_location()-method. This method can be used to query for the slice that contains a specific element combination. The following example demonstrates its usage.

Sparse output

If the descriptor size is large (this can be the case for instance with a myriad of different element types as well as high nmax and lmax) more often than not considerable parts of the features will be zero. In this case saving the results in a sparse matrix will save memory. DScribe does so by default using the scipy-library. Be aware between the different types:

Most operations work on sparse matrices as they would on numpy matrices. Otherwise, a sparse matrix can simply be converted calling the .toarray() method. For further information check the scipy documentation on sparse matrices.

Average output

One way of turning a local descriptor into a global descriptor is simply by taking the average over all atoms. Since SOAP separates features by atom types, this essentially means averaging over atoms of the same type.

The result will be a feature vector and not a matrix, so it no longer depends on the system size. This is necessary to compare two or more structures with different number of elements. We can do so by e.g. applying the distance metric of our choice.

It seems that the local environments of water and hydrogen peroxide are more similar to each other. To see more advanced methods for comparing structures of different sizes with each other, see the kernel building tutorial. Notice that simply averaging the SOAP vector does not always correspond to the Average Kernel discussed in the kernel building tutorial, as for non-linear kernels the order of kernel calculation and averaging matters.


Sandip De, Albert P. Bartók, Gábor Csányi, and Michele Ceriotti. Comparing molecules and solids across structural and alchemical space. Physical Chemistry Chemical Physics, 18(20):13754–13769, 2016. doi:10.1039/c6cp00415f.


Marc O J Jäger, Eiaki V Morooka, Filippo Federici Canova, Lauri Himanen, and Adam S Foster. Machine learning hydrogen adsorption on nanoclusters through structural descriptors. npj Computational Materials, 2018. doi:10.1038/s41524-018-0096-5.


Albert P. Bartók, Risi Kondor, and Gábor Csányi. On representing chemical environments. Physical Review B - Condensed Matter and Materials Physics, 87(18):1–16, 2013. doi:10.1103/PhysRevB.87.184115.