Database¶
For storing information about the workflows, machine learning, DFT simlations and other structure manipulations, MongoDB is used as a permament, shared database. MongoDB is a NoSQL database, with bson dictionaries as documents, which is very close to python documents. If you know python, MongoDB documents are easy to read and query.
Database format¶
In the following, we list the entries that data records need to have in each collection. A data record is assumed to be a python dictionary, and the tables give the keys that are expected to be found therein. The following collections are defined: IDs, simulations, workflows and machine_learning.
Collection: IDs¶
contains the ID values that will be assigned to the newest entries in other collections
- _id (int)
automatic compulsory internal id for MongoDB
- simulations (int)
ID to be used by the next new entry of type simulation
- workflows (int)
ID to be used by the next new entry of type workflows
- machine_learning (int)
ID to be used by the next new entry of type machine_learning
(notice how the field name matches the collection name)
Collection: simulations¶
- _id (int)
unique identifier
- source_id (int)
ID of the parent simulation that originated this, -1 if none
- workflow_id (int)
ID of workflow when instance was added, -1 if none
- wf_sim_id (int)
ID of simulation (unique within the workflow this belongs to)
- atoms (ATOMS)
dictionary with information about the atoms.
- nanoclusters (NANOCLUSTER)
list of dictionaries with information about the nanocluster(s)
- adsorbates (ADSORBATE)
list of dictionaries with information about the adsorbate(s)
- substrates (SUBSTRATE)
list of dictionaries with information about the substrate(s)
- operations (list)
List of dictionaries, each describing one operation. Always with respect to the parent simulation if applicable
- inp (dict)
property/value pairs describing the simulation input
- output (dict)
property/value pairs output by the calculation
For custom types ATOMS, NANOCLUSTER, ADSORBATE and SUBSTRATE see below.
Collection: workflows¶
- _id (int)
unique identifier
- username (str)
user who executed the workflow
- creation_time (str)
time of creation of the workflow
- parameters (dict)
workflow-specific parameters
- name (str)
custom name of workflow
- workflow_type (str)
custom type of workflow
Collection: machine_learning¶
- _id (int)
unique identifier
- workflow_id (int)
ID of workflow which the machine learning run was part of
- method (str)
name of the ML method: krr, nn, …
- method_params (dict)
Parameters of the method
- descriptor (str)
name of the descriptor: soap, mbtr, cm, …
- descriptor_params (dict)
Parameters of the descriptor used
- training_set (int[])
list of simulation IDs used for training
- validation_set (int[])
list of simulation IDs used in validation. If empty, cross-validation was used.
- test_set (int[])
list of simulation IDs used in testing. If empty, only validation was used
- prediction_set (int[])
list of simulation IDs used for prediction.
- metrics_training (dict)
dictionary of (“metric name”: value) on training set key: string = name of the metric value: float = calculated value
- metrics_validation (dict)
dictionary of (“metric name”: value) on validation set key: string = name of the metric value: float = calculated value
- metrics_test (dict)
dictionary of (“metric name”: value) on test set key: string = name of the metric value: float = calculated value
- output (dict)
relevant training output info
Ideally, method name corresponds to a python class/function in the platform, that is initialised with the parameter dictionary given in method_params. Similarly, descriptor name also matches a python class, to be initialised with its own given set of parameters, descriptor_params. The field output is a dictionary with all the useful output values from the calculation.
Custom Type: ATOMS¶
A dictionary for describing atoms in a system, conceptually close to ase.Atoms object:
- numbers (int[])
list of atomic numbers as numpy array [N] of ints
- positions (float[N,3])
positions as numpy matrix [Nx3] of doubles
- constraints (int[N,3])
frozen flags a matrix [Nx3] of int [optional] 1 = frozen, 0 = free
- pbc (bool)
use periodic boundaries
- cell (float[3,3])
matrix 3x3 with cell vectors on the rows
- celldisp (float[3,1])
displacement of cell from origin
- info (dict)
field for additional information related to structure
The order of atoms in this dictionary is the one found in the simulation input file.
Custom Type: ADSORBATE¶
- reference_id (int)
ID of the simulation to use as reference
- atom_ids (int[])
atom indices in the ATOMS dictionary of the simulation record
- site_class (str)
class of adsorption site: “top”, “bridge”, “hollow”, “4-fold hollow”
- site_ids (int[])
list of atom ids (in simulation record) that define the adsorption site
Custom Type: NANOCLUSTER¶
In general, simulation.nanoclusters is a list of dictionaries with this structure.
- reference_id (int)
ID of the simulation where this cluster was made, -1 if original
- atom_ids (int[])
atom indices in the ATOMS dictionary of the simulation record
Custom Type: SUBSTRATE¶
- reference_id (int)
ID of the parent support simulation, -1 if no parent
- atom_ids (int[])
atom indices in the corresponding ATOMS dictionary. See below
Database query examples¶
A few examples how to query that database are given in the gui/ folder on the github repository.