Database¶

For storing information about the workflows, machine learning, DFT simlations and other structure manipulations, MongoDB is used as a permament, shared database. MongoDB is a NoSQL database, with bson dictionaries as documents, which is very close to python documents. If you know python, MongoDB documents are easy to read and query.

Database format¶

In the following, we list the entries that data records need to have in each collection. A data record is assumed to be a python dictionary, and the tables give the keys that are expected to be found therein. The following collections are defined: IDs, simulations, workflows and machine_learning.

Collection: IDs¶

contains the ID values that will be assigned to the newest entries in other collections

_id (int): automatic compulsory internal id for MongoDB
simulations (int): ID to be used by the next new entry of type simulation
workflows (int): ID to be used by the next new entry of type workflows
machine_learning (int): ID to be used by the next new entry of type machine_learning

(notice how the field name matches the collection name)

Collection: simulations¶

_id (int): unique identifier
source_id (int): ID of the parent simulation that originated this, -1 if none
workflow_id (int): ID of workflow when instance was added, -1 if none
wf_sim_id (int): ID of simulation (unique within the workflow this belongs to)
atoms (ATOMS): dictionary with information about the atoms.
nanoclusters (NANOCLUSTER): list of dictionaries with information about the nanocluster(s)
adsorbates (ADSORBATE): list of dictionaries with information about the adsorbate(s)
substrates (SUBSTRATE): list of dictionaries with information about the substrate(s)
operations (list): List of dictionaries, each describing one operation. Always with respect to the parent simulation if applicable
inp (dict): property/value pairs describing the simulation input
output (dict): property/value pairs output by the calculation

For custom types ATOMS, NANOCLUSTER, ADSORBATE and SUBSTRATE see below.

Collection: workflows¶

_id (int): unique identifier
username (str): user who executed the workflow
creation_time (str): time of creation of the workflow
parameters (dict): workflow-specific parameters
name (str): custom name of workflow
workflow_type (str): custom type of workflow

Collection: machine_learning¶

_id (int): unique identifier
workflow_id (int): ID of workflow which the machine learning run was part of
method (str): name of the ML method: krr, nn, …
method_params (dict): Parameters of the method
descriptor (str): name of the descriptor: soap, mbtr, cm, …
descriptor_params (dict): Parameters of the descriptor used
training_set (int[]): list of simulation IDs used for training
validation_set (int[]): list of simulation IDs used in validation. If empty, cross-validation was used.
test_set (int[]): list of simulation IDs used in testing. If empty, only validation was used
prediction_set (int[]): list of simulation IDs used for prediction.
metrics_training (dict): dictionary of (“metric name”: value) on training set key: string = name of the metric value: float = calculated value
metrics_validation (dict): dictionary of (“metric name”: value) on validation set key: string = name of the metric value: float = calculated value
metrics_test (dict): dictionary of (“metric name”: value) on test set key: string = name of the metric value: float = calculated value
output (dict): relevant training output info

Ideally, method name corresponds to a python class/function in the platform, that is initialised with the parameter dictionary given in method_params. Similarly, descriptor name also matches a python class, to be initialised with its own given set of parameters, descriptor_params. The field output is a dictionary with all the useful output values from the calculation.

Custom Type: ATOMS¶

A dictionary for describing atoms in a system, conceptually close to ase.Atoms object:

numbers (int[]): list of atomic numbers as numpy array [N] of ints
positions (float[N,3]): positions as numpy matrix [Nx3] of doubles
constraints (int[N,3]): frozen flags a matrix [Nx3] of int [optional] 1 = frozen, 0 = free
pbc (bool): use periodic boundaries
cell (float[3,3]): matrix 3x3 with cell vectors on the rows
celldisp (float[3,1]): displacement of cell from origin
info (dict): field for additional information related to structure

The order of atoms in this dictionary is the one found in the simulation input file.

Custom Type: ADSORBATE¶

reference_id (int): ID of the simulation to use as reference
atom_ids (int[]): atom indices in the ATOMS dictionary of the simulation record
site_class (str): class of adsorption site: “top”, “bridge”, “hollow”, “4-fold hollow”
site_ids (int[]): list of atom ids (in simulation record) that define the adsorption site

Custom Type: NANOCLUSTER¶

In general, simulation.nanoclusters is a list of dictionaries with this structure.

reference_id (int): ID of the simulation where this cluster was made, -1 if original
atom_ids (int[]): atom indices in the ATOMS dictionary of the simulation record

Custom Type: SUBSTRATE¶

reference_id (int): ID of the parent support simulation, -1 if no parent
atom_ids (int[]): atom indices in the corresponding ATOMS dictionary. See below

Database query examples¶

A few examples how to query that database are given in the gui/ folder on the github repository.