Coulomb Matrix

Coulomb Matrix (CM) [1] is a simple global descriptor which mimics the electrostatic interaction between nuclei.

Coulomb matrix is calculated with the equation below.

\[\begin{split}\begin{equation} M_{ij}^\mathrm{Coulomb}=\left\{ \begin{matrix} 0.5 Z_i^{2.4} & \text{for } i = j \\ \frac{Z_i Z_j}{R_{ij}} & \text{for } i \neq j \end{matrix} \right. \end{equation}\end{split}\]

The diagonal elements can be seen as the interaction of an atom with itself and are essentially a polynomial fit of the atomic energies to the nuclear charge \(Z_i\). The off-diagonal elements represent the Coulomb repulsion between nuclei \(i\) and \(j\).

Let’s have a look at the CM for methanol:

\[\begin{split}\begin{bmatrix} 36.9 & 33.7 & 5.5 & 3.1 & 5.5 & 5.5 \\ 33.7 & 73.5 & 4.0 & 8.2 & 3.8 & 3.8 \\ 5.5 & 4.0 & 0.5 & 0.35 & 0.56 & 0.56 \\ 3.1 & 8.2 & 0.35 & 0.5 & 0.43 & 0.43 \\ 5.5 & 3.8 & 0.56 & 0.43 & 0.5 & 0.56 \\ 5.5 & 3.8 & 0.56 & 0.43 & 0.56 & 0.5 \end{bmatrix}\end{split}\]

In the matrix above the first row corresponds to carbon (C) in methanol interacting with all the other atoms (columns 2-5) and itself (column 1). Likewise, the first column displays the same numbers, since the matrix is symmetric. Furthermore, the second row (column) corresponds to oxygen and the remaining rows (columns) correspond to hydrogen (H). Can you determine which one is which?

Since the Coulomb Matrix was published in 2012 more sophisticated descriptors have been developed. However, CM still does a reasonably good job when comparing molecules with each other. Apart from that, it can be understood intuitively and is a good introduction to descriptors.

Setup

Instantiating the object that is used to create Coulomb matrices can be done as follows:

from dscribe.descriptors import CoulombMatrix

atomic_numbers = [1, 8]
rcut = 6.0
nmax = 8
lmax = 6

# Setting up the CM descriptor
cm = CoulombMatrix(
    n_atoms_max=6,
)

The constructor takes the following parameters:

Creation

After CM has been set up, it may be used on atomic structures with the create-method.

from ase.build import molecule

# Molecule created as an ASE.Atoms
methanol = molecule("CH3OH")

# Create CM output for the system
cm_methanol = cm.create(methanol)

print(cm_methanol)
print("flattened", cm_methanol.shape)

# Create output for multiple system
samples = [molecule("H2O"), molecule("NO2"), molecule("CO2")]
coulomb_matrices = cm.create(samples)            # Serial
coulomb_matrices = cm.create(samples, n_jobs=2)  # Parallel

The call syntax for the create-function is as follows:

Note that if you specify in n_atoms_max a lower number than atoms in your structure it will cause an error. The output will in this case be a flattened matrix, specifically a numpy array with size #atoms * #atoms. The number of features may be requested beforehand with the get_number_of_features()-method.

In the case of multiple samples, the creation can also be parallellized by using the n_jobs-parameter. This splits the list of structures into equally sized parts and spaws a separate process to handle each part.

Examples

The following examples demonstrate usage of the descriptor. These examples are also available in dscribe/examples/coulombmatrix.py.

Flattening

You can control whether the returned array is two-dimensional or one-dimensional by using the flatten-parameter

cm = CoulombMatrix(
    n_atoms_max=6, flatten=False
)
cm_methanol = cm.create(methanol)

print(cm_methanol)
print("not flattened", cm_methanol.shape)

Options for permutation

The following snippet introduces the different options for handling permutation invariance. See [2] for more information on these methods.

cm = CoulombMatrix(
    n_atoms_max=6, flatten=False,
    permutation='none'
)

cm_methanol = cm.create(methanol)
print(methanol.get_chemical_symbols())
print("in order of appearance", cm_methanol)

# Sort by Euclidean (L2) norm.
cm = CoulombMatrix(
    n_atoms_max=6, flatten=False,
    permutation='sorted_l2'
)

cm_methanol = cm.create(methanol)
print("default: sorted by L2-norm", cm_methanol)

# Random
cm = CoulombMatrix(
    n_atoms_max=6, flatten=False,
    permutation='random',
    sigma=70,
    seed=None
)

cm_methanol = cm.create(methanol)
print("randomly sorted", cm_methanol)

# Eigenspectrum
cm = CoulombMatrix(
    n_atoms_max=6, flatten=False,
    permutation='eigenspectrum'
)

cm_methanol = cm.create(methanol)
print("eigenvalues", cm_methanol)

sorted_l2 (default): Sorts rows and columns by their L2-norm.
none: keeps the order of the rows and columns as the atoms are read from the ase object.
random: The term random can be misleading at first sight because it does not scramble the rows and columns completely randomly. The rows and columns are sorted by their L2-norm after applying Gaussian noise to the norms. The standard deviation of the noise is determined by the additionally required sigma-parameter. sigma determines the standard deviation of the gaussian distributed noise determining how much the rows and columns of the randomly sorted matrix are scrambled. Feel free to try different sigma values to see the effect on the ordering. Optionally, you can specify a random seed. sigma and seed are ignored if permutation is other than “random”. This option is useful if you want to augment your dataset, similar to augmented image datasets where each image gets mirrored, rotated, cropped or otherwise transformed. You would need to create several instances of the randomly sorted CM in a loop. The advantage of augmenting data like this over using completely random CM lies in the lower number of “likely permutations”. Rows and columns of the CM are allowed to flip just so that the feature space (all possible CM) is smooth but also compact.
eigenspectrum: Only the eigenvalues of the matrix are returned sorted by their absolute value in descending order. On one hand, it is a more compact descriptor, but on the other hand, it potentially loses information encoded in the CM interactions.

Zero-padding

The number of features in CM depends on the size of the system. Since most machine learning methods require size-consistent inputs it is convenient to define the maximum number of atoms n_atoms_max in a dataset. If the structure has fewer atoms, the rest of the CM will be zero-padded. One can imagine non-interacting ghost atoms as place-holders to ensure the same number of atoms in every system.

cm = CoulombMatrix(
    n_atoms_max=10, flatten=False
)
cm_methanol = cm.create(methanol)

print("zero-padded", cm_methanol)
print(cm_methanol.shape)

Not meant for periodic systems

The CM was not designed for periodic systems. If you do add periodic boundary conditions, you will see that it does not change the elements.

methanol.set_pbc([True, True, True])
methanol.set_cell([[10.0, 0.0, 0.0],
    [0.0, 10.0, 0.0],
    [0.0, 0.0, 10.0],
    ])

cm = CoulombMatrix(
    n_atoms_max=6, flatten=False
)

cm_methanol = cm.create(methanol)
print("with pbc", cm_methanol)

Instead, the Sine Matrix and the Ewald Sum Matrix have been designed as periodic counterparts to the CM.

Invariance

A good descriptor should be invariant with respect to translation, rotation and permutation. No matter how you translate or rotate it or change the indexing of the atoms (not the atom types!), it will still be the same molecule! The following lines confirm that this is true for CM.

cm = CoulombMatrix(
    n_atoms_max=6,
    flatten=False,
    permutation="sorted_l2"
)

# Translation
methanol.translate((5, 7, 9))
cm_methanol = cm.create(methanol)
print(cm_methanol)

# Rotation
methanol.rotate(90, 'z', center=(0, 0, 0))
cm_methanol = cm.create(methanol)
print(cm_methanol)

# Permutation
upside_down_methanol = methanol[::-1]
cm_methanol = cm.create(upside_down_methanol)
print(cm_methanol)

1: Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O. Anatole von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108:058301, Jan 2012. doi:10.1103/PhysRevLett.108.058301.
2: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, and Klaus-Robert Müller. Learning invariant representations of molecules for atomization energy prediction. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 440–448. Curran Associates, Inc., 2012. URL: http://papers.nips.cc/paper/4830-learning-invariant-representations-of-molecules-for-atomization-energy-prediction.pdf.