Basic concepts¶
This tutorial covers the very basic concepts that are needed to get started with DScribe and descriptors in general. Please read through this page before moving on to other tutorials if you have no previous experience or simply want to refresh some core concepts.
DScribe provides methods to transform atomic structures into fixed-size numeric vectors. These vectors are built in a way that they efficiently summarize the contents of the input structure. Such a transformation is very useful for various purposes, e.g.
Input for supervised machine learning models, e.g. regression.
Input for unsupervised machine learning models, e.g. clustering.
Visualizing and analyzing a local chemical environment.
Measuring similarity of structures or local structural sites.
etc.
Please read more details in our open-access article: DScribe: Library of descriptors for machine learning in materials science
Terminology¶
structure: An atomic geometry containing the chemical species, atomic positions and optionally a unit cell for periodic structures.
descriptor: A particular method for transforming a structure into a constant sized vector. There are various options which are suitable for different use cases. DScribe currently provides the following descriptors: Coulomb matrix, Sine matrix, Ewald sum matrix, Atom-centered Symmetry Functions (ACSF), Smooth Overlap of Atomic Positions (SOAP), Many-body Tensor Representation (MBTR) and Local Many-body Tensor Representation (LMBTR).
descriptor object: In DScribe there is a single python class for each descriptor. The object that is instantiated from this class is called a descriptor object.
feature vector: The descriptor objects produce a single one-dimensional vector for each input structure. This is called a feature vector.
feature: A single channel/dimension in the multi-dimensional feature vector produced by a descriptor object for a structure. Each feature is a number that represents a specific structural/chemical property in the structure.
Typical workflow¶
DScribe uses the Atomic Simulation Environment (ASE) to represent and work with atomic structures as it provides convenient ways to read, write, create and manipulate them. The first step is thus to transform your atomic structures into ASE Atoms.
For example:
from ase.io import read from ase.build import molecule from ase import Atoms # Let's use ASE to create atomic structures as ase.Atoms objects. structure1 = read("water.xyz") structure2 = molecule("H2O") structure3 = Atoms(symbols=["C", "O"], positions=[[0, 0, 0], [1.128, 0, 0]])
Usually the descriptors require some knowledge about the dataset you are analyzing. This means that you wll need to gather information about the expected input space of all your analyzed structures. Often simply gathering a list of the present chemical species is enough.
For example:
# Let's create a list of structures and gather the chemical elements that are # in all the structures. structures = [structure1, structure2, structure3] species = set() for structure in structures: species.update(structure.get_chemical_symbols())
Setup the descriptor object. The exact setup depends on the used descriptor and your use case. Notice that typically you will want to instantiate only one descriptor object which will handle all structures in your dataset. You should read the original articles or the specific tutorials to understand the meaning of different settings. For machine learning purposes you may also want to cross-validate the different settings to find the best-performing ones.
For example:
# Let's configure the SOAP descriptor. from dscribe.descriptors import SOAP soap = SOAP( species=species, periodic=False, rcut=5, nmax=8, lmax=8, average="outer", sparse=False )
Call the
create()
function of the descriptor object on a single Atoms object or a list of them. Optionally provide a number of cores to parallelize the work.# Let's create SOAP feature vectors for each structure feature_vectors = soap.create(structures, n_jobs=1)
The output is either 2D (number of structures \(\times\) number of features) numpy array or scipy sparse matrix (depends on the
sparse
setting of your descriptor object) that you can store store for later use. Have fun!