Coulomb Matrix

Coulomb Matrix (CM) [1] is a simple global descriptor which mimics the electrostatic interaction between nuclei.

Coulomb matrix is calculated with the equation below.

\[\begin{split}\begin{equation} M_{ij}^\mathrm{Coulomb}=\left\{ \begin{matrix} 0.5 Z_i^{2.4} & \text{for } i = j \\ \frac{Z_i Z_j}{R_{ij}} & \text{for } i \neq j \end{matrix} \right. \end{equation}\end{split}\]

The diagonal elements can be seen as the interaction of an atom with itself and are essentially a polynomial fit of the atomic energies to the nuclear charge \(Z_i\). The off-diagonal elements represent the Coulomb repulsion between nuclei \(i\) and \(j\).

Let’s have a look at the CM for methanol:

image of methanol
\[\begin{split}\begin{bmatrix} 36.9 & 33.7 & 5.5 & 3.1 & 5.5 & 5.5 \\ 33.7 & 73.5 & 4.0 & 8.2 & 3.8 & 3.8 \\ 5.5 & 4.0 & 0.5 & 0.35 & 0.56 & 0.56 \\ 3.1 & 8.2 & 0.35 & 0.5 & 0.43 & 0.43 \\ 5.5 & 3.8 & 0.56 & 0.43 & 0.5 & 0.56 \\ 5.5 & 3.8 & 0.56 & 0.43 & 0.56 & 0.5 \end{bmatrix}\end{split}\]

In the matrix above the first row corresponds to carbon (C) in methanol interacting with all the other atoms (columns 2-5) and itself (column 1). Likewise, the first column displays the same numbers, since the matrix is symmetric. Furthermore, the second row (column) corresponds to oxygen and the remaining rows (columns) correspond to hydrogen (H). Can you determine which one is which?

Since the Coulomb Matrix was published in 2012 more sophisticated descriptors have been developed. However, CM still does a reasonably good job when comparing molecules with each other. Apart from that, it can be understood intuitively and is a good introduction to descriptors.

Setup

Instantiating the object that is used to create Coulomb matrices can be done as follows:

from dscribe.descriptors import CoulombMatrix

# Setting up the CM descriptor
cm = CoulombMatrix(n_atoms_max=6)

The constructor takes the following parameters:

CoulombMatrix.__init__(n_atoms_max, permutation='sorted_l2', sigma=None, seed=None, sparse=False)[source]
Parameters:
  • n_atoms_max (int) – The maximum nuber of atoms that any of the samples can have. This controls how much zeros need to be padded to the final result.

  • permutation (string) –

    Defines the method for handling permutational invariance. Can be one of the following:

    • none: The matrix is returned in the order defined by the Atoms.

    • sorted_l2: The rows and columns are sorted by the L2 norm.

    • eigenspectrum: Only the eigenvalues are returned sorted by their absolute value in descending order.

    • random: The rows and columns are sorted by their L2 norm after applying Gaussian noise to the norms. The standard deviation of the noise is determined by the sigma-parameter.

  • sigma (float) – Provide only when using the random-permutation option. Standard deviation of the gaussian distributed noise determining how much the rows and columns of the randomly sorted matrix are scrambled.

  • seed (int) – Provide only when using the random-permutation option. A seed to use for drawing samples from a normal distribution.

  • sparse (bool) – Whether the output should be a sparse matrix or a dense numpy array.

Creation

After CM has been set up, it may be used on atomic structures with the create-method.

# Create CM output for the system
cm_methanol = cm.create(methanol)

print(cm_methanol)

# Create output for multiple system
samples = [molecule("H2O"), molecule("NO2"), molecule("CO2")]
coulomb_matrices = cm.create(samples)            # Serial
coulomb_matrices = cm.create(samples, n_jobs=2)  # Parallel

The call syntax for the create-function is as follows:

CoulombMatrix.create(system, n_jobs=1, only_physical_cores=False, verbose=False)[source]

Return the Coulomb matrix for the given systems.

Parameters:
  • system (ase.Atoms or list of ase.Atoms) – One or many atomic structures.

  • n_jobs (int) – Number of parallel jobs to instantiate. Parallellizes the calculation across samples. Defaults to serial calculation with n_jobs=1. If a negative number is given, the used cpus will be calculated with, n_cpus + n_jobs, where n_cpus is the amount of CPUs as reported by the OS. With only_physical_cores you can control which types of CPUs are counted in n_cpus.

  • only_physical_cores (bool) – If a negative n_jobs is given, determines which types of CPUs are used in calculating the number of jobs. If set to False (default), also virtual CPUs are counted. If set to True, only physical CPUs are counted.

  • verbose (bool) – Controls whether to print the progress of each job into to the console.

Returns:

Coulomb matrix for the given systems. The return type depends on the ‘sparse’-attribute. The first dimension is determined by the amount of systems.

Return type:

np.ndarray | sparse.COO

Note that if you specify in n_atoms_max a lower number than atoms in your structure it will cause an error. The output will in this case be a flattened matrix, specifically a numpy array with size n_atoms * n_atoms. The number of features may be requested beforehand with the get_number_of_features()-method.

In the case of multiple samples, the creation can also be parallellized by using the n_jobs-parameter. This splits the list of structures into equally sized parts and spaws a separate process to handle each part.

Examples

The following examples demonstrate usage of the descriptor. These examples are also available in dscribe/examples/coulombmatrix.py.

Flattening

You can control whether the returned array is two-dimensional or one-dimensional by using the flatten-parameter

Options for permutation

The following snippet introduces the different options for handling permutation invariance. See [2] for more information on these methods.

# No sorting
cm = CoulombMatrix(n_atoms_max=6, permutation='none')

cm_methanol = cm.create(methanol)
print(methanol.get_chemical_symbols())
print("in order of appearance", cm_methanol)

# Sort by Euclidean (L2) norm.
cm = CoulombMatrix(n_atoms_max=6, permutation='sorted_l2')

cm_methanol = cm.create(methanol)
print("default: sorted by L2-norm", cm_methanol)

# Random
cm = CoulombMatrix(
    n_atoms_max=6,
    permutation='random',
    sigma=70,
    seed=None
)

cm_methanol = cm.create(methanol)
print("randomly sorted", cm_methanol)

# Eigenspectrum
cm = CoulombMatrix(
    n_atoms_max=6,
    permutation='eigenspectrum'
)

cm_methanol = cm.create(methanol)
print("eigenvalues", cm_methanol)
  • sorted_l2 (default): Sorts rows and columns by their L2-norm.

  • none: keeps the order of the rows and columns as the atoms are read from the ase object.

  • random: The term random can be misleading at first sight because it does not scramble the rows and columns completely randomly. The rows and columns are sorted by their L2-norm after applying Gaussian noise to the norms. The standard deviation of the noise is determined by the additionally required sigma-parameter. sigma determines the standard deviation of the gaussian distributed noise determining how much the rows and columns of the randomly sorted matrix are scrambled. Feel free to try different sigma values to see the effect on the ordering. Optionally, you can specify a random seed. sigma and seed are ignored if permutation` is other than random. This option is useful if you want to augment your dataset, similar to augmented image datasets where each image gets mirrored, rotated, cropped or otherwise transformed. You would need to create several instances of the randomly sorted CM in a loop. The advantage of augmenting data like this over using completely random CM lies in the lower number of “likely permutations”. Rows and columns of the CM are allowed to flip just so that the feature space (all possible CM) is smooth but also compact.

  • eigenspectrum: Only the eigenvalues of the matrix are returned sorted by their absolute value in descending order. On one hand, it is a more compact descriptor, but on the other hand, it potentially loses information encoded in the CM interactions.

Zero-padding

The number of non-zero features in CM depends on the size of the system. If the structure has fewer atoms than n_atoms_max, the rest of the CM will be zero-padded. One can imagine non-interacting ghost atoms as place-holders to ensure the same number of atoms in every system.

# Zero-padding
cm = CoulombMatrix(n_atoms_max=10)
cm_methanol = cm.create(methanol)

print("zero-padded", cm_methanol)
print(cm_methanol.shape)

Not meant for periodic systems

The CM was not designed for periodic systems. If you do add periodic boundary conditions, you will see that it does not change the elements.

# Not meant for periodic systems
methanol.set_pbc([True, True, True])
methanol.set_cell([[10.0, 0.0, 0.0],
    [0.0, 10.0, 0.0],
    [0.0, 0.0, 10.0],
    ])

cm = CoulombMatrix(n_atoms_max=6)

cm_methanol = cm.create(methanol)
print("with pbc", cm_methanol)

Instead, the Sine Matrix and the Ewald Sum Matrix have been designed as periodic counterparts to the CM.

Invariance

A good descriptor should be invariant with respect to translation, rotation and permutation. No matter how you translate or rotate it or change the indexing of the atoms (not the atom types!), it will still be the same molecule! The following lines confirm that this is true for CM.

# Invariance
cm = CoulombMatrix(
    n_atoms_max=6,
    permutation="sorted_l2"
)

# Translation
methanol.translate((5, 7, 9))
cm_methanol = cm.create(methanol)
print(cm_methanol)

# Rotation
methanol.rotate(90, 'z', center=(0, 0, 0))
cm_methanol = cm.create(methanol)
print(cm_methanol)

# Permutation
upside_down_methanol = methanol[::-1]
cm_methanol = cm.create(upside_down_methanol)
print(cm_methanol)
[1]

Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O. Anatole von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108:058301, Jan 2012. doi:10.1103/PhysRevLett.108.058301.

[2]

Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, and Klaus-Robert Müller. Learning invariant representations of molecules for atomization energy prediction. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 440–448. Curran Associates, Inc., 2012. URL: http://papers.nips.cc/paper/4830-learning-invariant-representations-of-molecules-for-atomization-energy-prediction.pdf.