Coulomb Matrix
==============
Coulomb Matrix (CM) :cite:`cm` is a simple global descriptor which mimics the
electrostatic interaction between nuclei.
Coulomb matrix is calculated with the equation below.
.. math::
\begin{equation}
M_{ij}^\mathrm{Coulomb}=\left\{
\begin{matrix}
0.5 Z_i^{2.4} & \text{for } i = j \\
\frac{Z_i Z_j}{R_{ij}} & \text{for } i \neq j
\end{matrix}
\right.
\end{equation}
The diagonal elements can be seen as the interaction of an atom with itself and
are essentially a polynomial fit of the atomic energies to the nuclear charge
:math:`Z_i`. The off-diagonal elements represent the Coulomb repulsion between
nuclei :math:`i` and :math:`j`.
Let's have a look at the CM for methanol:
.. image:: /_static/img/methanol-3d-balls.png
:width: 344px
:height: 229px
:scale: 50 %
:alt: image of methanol
:align: center
.. math::
\begin{bmatrix}
36.9 & 33.7 & 5.5 & 3.1 & 5.5 & 5.5 \\
33.7 & 73.5 & 4.0 & 8.2 & 3.8 & 3.8 \\
5.5 & 4.0 & 0.5 & 0.35 & 0.56 & 0.56 \\
3.1 & 8.2 & 0.35 & 0.5 & 0.43 & 0.43 \\
5.5 & 3.8 & 0.56 & 0.43 & 0.5 & 0.56 \\
5.5 & 3.8 & 0.56 & 0.43 & 0.56 & 0.5
\end{bmatrix}
In the matrix above the first row corresponds to carbon (C) in methanol
interacting with all the other atoms (columns 2-5) and itself (column 1).
Likewise, the first column displays the same numbers, since the matrix is
symmetric. Furthermore, the second row (column) corresponds to oxygen and the
remaining rows (columns) correspond to hydrogen (H). Can you determine which
one is which?
Since the Coulomb Matrix was published in 2012 more sophisticated descriptors
have been developed. However, CM still does a reasonably good job when
comparing molecules with each other. Apart from that, it can be understood
intuitively and is a good introduction to descriptors.
Setup
-----
Instantiating the object that is used to create Coulomb matrices can be done as
follows:
.. literalinclude:: ../../../examples/coulombmatrix.py
:language: python
:lines: 1-11
The constructor takes the following parameters:
.. automethod:: dscribe.descriptors.coulombmatrix.CoulombMatrix.__init__
Creation
--------
After CM has been set up, it may be used on atomic structures with the
:meth:`~.CoulombMatrix.create`-method.
.. literalinclude:: ../../../examples/coulombmatrix.py
:start-after: Creation
:language: python
:lines: 1-15
The call syntax for the create-function is as follows:
.. automethod:: dscribe.descriptors.coulombmatrix.CoulombMatrix.create
Note that if you specify in *n_atoms_max* a lower number than atoms in your
structure it will cause an error. The output will in this case be a flattened
matrix, specifically a numpy array with size #atoms * #atoms. The number of
features may be requested beforehand with the
:meth:`~.MatrixDescriptor.get_number_of_features`-method.
In the case of multiple samples, the creation can also be parallellized by using the
*n_jobs*-parameter. This splits the list of structures into equally sized parts
and spaws a separate process to handle each part.
Examples
--------
The following examples demonstrate usage of the descriptor. These
examples are also available in dscribe/examples/coulombmatrix.py.
No flattening
~~~~~~~~~~~~~
You can control whether the returned array is two-dimensional or
one-dimensional by using the *flatten*-parameter
.. literalinclude:: ../../../examples/coulombmatrix.py
:language: python
:start-after: No flattening
:lines: 1-7
No Sorting
~~~~~~~~~~~
By default, CM is sorted by the L2-norm (more on that later). In order to get
the unsorted CM it is necessary to specify the keyword *permutation = "none"*
when setting it up.
.. literalinclude:: ../../../examples/coulombmatrix.py
:language: python
:start-after: No sorting
:lines: 1-8
Zero-padding
~~~~~~~~~~~~~
The number of features in CM depends on the size of the system. Since most
machine learning methods require size-consistent inputs it is convenient to
define the maximum number of atoms *n_atoms_max* in a dataset. If the structure
has fewer atoms, the rest of the CM will be zero-padded. One can imagine
non-interacting ghost atoms as place-holders to ensure the same number of atoms
in every system.
.. literalinclude:: ../../../examples/coulombmatrix.py
:language: python
:start-after: Zero-padding
:lines: 1-7
Not meant for periodic systems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The CM was not designed for periodic systems. If you do add periodic boundary
conditions, you will see that it does not change the elements.
.. literalinclude:: ../../../examples/coulombmatrix.py
:language: python
:start-after: Not meant for periodic systems
:lines: 1-12
Instead, the :doc:`Sine Matrix ` and the `Ewald Matrix
` have been designed as periodic counterparts to the CM.
Invariance
-----------
A good descriptor should be invariant with respect to translation, rotation and
permutation. No matter how you translate or rotate it or change the indexing of
the atoms (not the atom types!), it will still be the same molecule! The
following lines confirm that this is true for CM.
.. literalinclude:: ../../../examples/coulombmatrix.py
:language: python
:start-after: Invariance
:lines: 1-20
Options for permutation
-----------------------
The following snippet introduces the different options for handling permutation
invariance. See :cite:`cm_versions` for more information on these methods.
.. literalinclude:: ../../../examples/coulombmatrix.py
:language: python
:start-after: No sorting
:lines: 1-37
- **sorted_l2 (default)**: Sorts rows and columns by their L2-norm.
- **none**: keeps the order of the rows and columns as the atoms are read from
the ase object.
- **random**: The term random can be misleading at first sight because it does not
scramble the rows and columns completely randomly. The rows and columns are
sorted by their L2-norm after applying Gaussian noise to the norms. The
standard deviation of the noise is determined by the additionally required
*sigma*-parameter. *sigma* determines the standard deviation of the gaussian
distributed noise determining how much the rows and columns of the randomly
sorted matrix are scrambled. Feel free to try different *sigma* values to
see the effect on the ordering. Optionally, you can specify a random *seed*.
*sigma* and *seed* are ignored if *permutation* is other than "random". This
option is useful if you want to augment your dataset, similar to augmented
image datasets where each image gets mirrored, rotated, cropped or otherwise
transformed. You would need to create several instances of the randomly
sorted CM in a loop. The advantage of augmenting data like this over using
completely random CM lies in the lower number of "likely permutations". Rows
and columns of the CM are allowed to flip just so that the feature space
(all possible CM) is smooth but also compact.
- **eigenspectrum**: Only the eigenvalues of the matrix are returned sorted by
their absolute value in descending order. On one hand, it is a more compact
descriptor, but on the other hand, it potentially loses information encoded
in the CM interactions.
.. bibliography:: ../references.bib
:style: unsrt
:filter: docname in docnames