Sparse output
Many of the descriptors become very sparse when using a large chemical space, or when calculating derivatives. Because of this, DScribe provides the possibility of creating the output in a sparse format. The sparse output simply means that only non-zero entries are stored. This can create significant savings in both RAM and disk usage.
From version 1.0.0 onwards, the sparse output uses the sparse.COO
class
from the sparse library. The main
benefit compared to e.g. the sparse formats provided by scipy is that
sparse.COO
supports n-dimensional sparse output with a convenient
slicing syntax.
Persistence
In order to save/load the sparse output you will need to use the sparse.save_npz/sparse.load_npz functions from the sparse library. The following example demonstrates this:
import sparse
from ase.build import molecule
from dscribe.descriptors import SOAP
# Let's create SOAP feature vectors for two structures and all positions. If
# the output sizes are the same for each structure, a single 3D array is
# created.
soap = SOAP(
species=["C", "H", "O"],
periodic=False,
r_cut=5,
n_max=8,
l_max=8,
average="off",
sparse=True
)
soap_features = soap.create([molecule("H2O"), molecule("CO2")])
# Save the output to disk and load it back.
sparse.save_npz("soap.npz", soap_features)
soap_features = sparse.load_npz("soap.npz")
Note
Do not confuse sparse.save_npz
/sparse.load_npz
with the
similarly named functions in scipy.sparse
.
Conversion
Many external libraries still only support either dense numpy arrays or the 2D
sparse matrices from scipy.sparse
. This is mostly due to the efficient
linear algebra routines that are implemented for them. Whenever you need such
format, you can simply convert the output provided by DScribe to the needed
format with todense(),
tocsr()
or tocsc():
dense = soap_features.todense()
csr = soap_features[0, :, :].tocsr()
csc = soap_features[0, :, :].tocsc()
Note
Because scipy.sparse
only suppports 2D sparse arrays, you can only
call the tocsr()
/tocsc()
-functions on 2D slices.