Sparse output

Many of the descriptors become very sparse when using a large chemical space, or when calculating derivatives. Because of this, DScribe provides the possibility of creating the output in a sparse format. The sparse output simply means that only non-zero entries are stored. This can create significant savings in both RAM and disk usage.

From version 1.0.0 onwards, the sparse output uses the sparse.COO class from the sparse library. The main benefit compared to e.g. the sparse formats provided by scipy is that sparse.COO supports n-dimensional sparse output with a convenient slicing syntax.

Persistence

In order to save/load the sparse output you will need to use the sparse.save_npz/sparse.load_npz functions from the sparse library. The following example demonstrates this:

import sparse
from ase.build import molecule
from dscribe.descriptors import SOAP

# Let's create SOAP feature vectors for two structures and all positions. If
# the output sizes are the same for each structure, a single 3D array is
# created.
soap = SOAP(
    species=["C", "H", "O"],
    periodic=False,
    r_cut=5,
    n_max=8,
    l_max=8,
    average="off",
    sparse=True
)
soap_features = soap.create([molecule("H2O"), molecule("CO2")])

# Save the output to disk and load it back.
sparse.save_npz("soap.npz", soap_features)
soap_features = sparse.load_npz("soap.npz")

Note

Do not confuse sparse.save_npz/sparse.load_npz with the similarly named functions in scipy.sparse.

Conversion

Many external libraries still only support either dense numpy arrays or the 2D sparse matrices from scipy.sparse. This is mostly due to the efficient linear algebra routines that are implemented for them. Whenever you need such format, you can simply convert the output provided by DScribe to the needed format with todense(), tocsr() or tocsc():

dense = soap_features.todense()
csr = soap_features[0, :, :].tocsr()
csc = soap_features[0, :, :].tocsc()

Note

Because scipy.sparse only suppports 2D sparse arrays, you can only call the tocsr()/tocsc()-functions on 2D slices.