Basic concepts

This tutorial covers the very basic concepts that are needed to get started with DScribe and descriptors in general. Please read through this page before moving on to other tutorials if you have no previous experience or simply want to refresh some core concepts.

DScribe provides methods to transform atomic structures into fixed-size numeric vectors. These vectors are built in a way that they efficiently summarize the contents of the input structure. Such a transformation is very useful for various purposes, e.g.

Input for supervised machine learning models, e.g. regression.

Input for unsupervised machine learning models, e.g. clustering.

Visualizing and analyzing a local chemical environment.

Measuring similarity of structures or local structural sites.

etc.

You can find more details in our open-access articles:

DScribe: Library of descriptors for machine learning in materials science

Updates to the DScribe library: New descriptors and derivatives

Terminology

structure: An atomic geometry containing the chemical species, atomic positions and optionally a unit cell for periodic structures.

descriptor: A particular method for transforming a structure into a constant sized vector. There are various options which are suitable for different use cases.

descriptor object: In DScribe there is a single python class for each descriptor. The object that is instantiated from this class is called a descriptor object.

feature vector: The descriptor objects produce a single one-dimensional vector for each input structure. This is called a feature vector.

feature: A single channel/dimension in the multi-dimensional feature vector produced by a descriptor object for a structure. Each feature is a number that represents a specific structural/chemical property in the structure.

Typical workflow

DScribe uses the Atomic Simulation Environment (ASE) to represent and work with atomic structures as it provides convenient ways to read, write, create and manipulate them. The first step is thus to transform your atomic structures into ASE Atoms.

For example:

from ase.io import read
from ase.build import molecule
from ase import Atoms

# Let's use ASE to create atomic structures as ase.Atoms objects.
structure1 = read("water.xyz")
structure2 = molecule("H2O")
structure3 = Atoms(symbols=["C", "O"], positions=[[0, 0, 0], [1.128, 0, 0]])

Usually the descriptors require some knowledge about the dataset you are analyzing. This means that you wll need to gather information about the expected input space of all your analyzed structures. Often simply gathering a list of the present chemical species is enough.

For example:
```
# Let's create a list of structures and gather the chemical elements that are
# in all the structures.
structures = [structure1, structure2, structure3]
species = set()
for structure in structures:
    species.update(structure.get_chemical_symbols())
```
Setup the descriptor object. The exact setup depends on the used descriptor and your use case. Notice that typically you will want to instantiate only one descriptor object which will handle all structures in your dataset. You should read our open-access articles or the specific tutorials to understand the meaning of different settings. For machine learning purposes you may also want to cross-validate the different settings to find the best-performing ones.

For example:
```
# Let's configure the SOAP descriptor.
from dscribe.descriptors import SOAP

soap = SOAP(
    species=species,
    periodic=False,
    r_cut=5,
    n_max=8,
    l_max=8,
    average="outer",
    sparse=False
)
```
Call the create() function of the descriptor object on a single Atoms object or a list of them. Optionally provide a number of cores to parallelize the work across the structures. Note that the computation is parallellized across different structures and you will only see proper scaling once you feed more than one structure to create.
```
# Let's create SOAP feature vectors for each structure
feature_vectors = soap.create(structures, n_jobs=1)
```
The output is either 2D (number of structures \(\times\) number of features) numpy array or sparse.COO array. (depends on the sparse setting of your descriptor object) that you can store store for later use. For more information about the sparse format, please see the documentation on sparse formats.
If you are interested in the derivatives with respect to atomic positions, use the derivatives() function. It can also be configured to return the descriptor at the same time which can be faster than calculating the two separately.
```
# Let's create derivatives and feature vectors for each structure
derivatives, feature_vectors = soap.derivatives(
    structures,
    return_descriptor=True,
    n_jobs=1
)
```
For more information on the derivatives, please see the documentation on derivatives.