Database

For storing information about the workflows, machine learning, DFT simlations and other structure manipulations, MongoDB is used as a permament, shared database. MongoDB is a NoSQL database, with bson dictionaries as documents, which is very close to python documents. If you know python, MongoDB documents are easy to read and query.

Database format

In the following, we list the entries that data records need to have in each collection. A data record is assumed to be a python dictionary, and the tables give the keys that are expected to be found therein. The following collections are defined: IDs, simulations, workflows and machine_learning.

Collection: IDs

contains the ID values that will be assigned to the newest entries in other collections

_id (int)

automatic compulsory internal id for MongoDB

simulations (int)

ID to be used by the next new entry of type simulation

workflows (int)

ID to be used by the next new entry of type workflows

machine_learning (int)

ID to be used by the next new entry of type machine_learning

(notice how the field name matches the collection name)

Collection: simulations

_id (int)

unique identifier

source_id (int)

ID of the parent simulation that originated this, -1 if none

workflow_id (int)

ID of workflow when instance was added, -1 if none

wf_sim_id (int)

ID of simulation (unique within the workflow this belongs to)

atoms (ATOMS)

dictionary with information about the atoms.

nanoclusters (NANOCLUSTER)

list of dictionaries with information about the nanocluster(s)

adsorbates (ADSORBATE)

list of dictionaries with information about the adsorbate(s)

substrates (SUBSTRATE)

list of dictionaries with information about the substrate(s)

operations (list)

List of dictionaries, each describing one operation. Always with respect to the parent simulation if applicable

inp (dict)

property/value pairs describing the simulation input

output (dict)

property/value pairs output by the calculation

For custom types ATOMS, NANOCLUSTER, ADSORBATE and SUBSTRATE see below.

Collection: workflows

_id (int)

unique identifier

username (str)

user who executed the workflow

creation_time (str)

time of creation of the workflow

parameters (dict)

workflow-specific parameters

name (str)

custom name of workflow

workflow_type (str)

custom type of workflow

Collection: machine_learning

_id (int)

unique identifier

workflow_id (int)

ID of workflow which the machine learning run was part of

method (str)

name of the ML method: krr, nn, …

method_params (dict)

Parameters of the method

descriptor (str)

name of the descriptor: soap, mbtr, cm, …

descriptor_params (dict)

Parameters of the descriptor used

training_set (int[])

list of simulation IDs used for training

validation_set (int[])

list of simulation IDs used in validation. If empty, cross-validation was used.

test_set (int[])

list of simulation IDs used in testing. If empty, only validation was used

prediction_set (int[])

list of simulation IDs used for prediction.

metrics_training (dict)

dictionary of (“metric name”: value) on training set key: string = name of the metric value: float = calculated value

metrics_validation (dict)

dictionary of (“metric name”: value) on validation set key: string = name of the metric value: float = calculated value

metrics_test (dict)

dictionary of (“metric name”: value) on test set key: string = name of the metric value: float = calculated value

output (dict)

relevant training output info

Ideally, method name corresponds to a python class/function in the platform, that is initialised with the parameter dictionary given in method_params. Similarly, descriptor name also matches a python class, to be initialised with its own given set of parameters, descriptor_params. The field output is a dictionary with all the useful output values from the calculation.

Custom Type: ATOMS

A dictionary for describing atoms in a system, conceptually close to ase.Atoms object:

numbers (int[])

list of atomic numbers as numpy array [N] of ints

positions (float[N,3])

positions as numpy matrix [Nx3] of doubles

constraints (int[N,3])

frozen flags a matrix [Nx3] of int [optional] 1 = frozen, 0 = free

pbc (bool)

use periodic boundaries

cell (float[3,3])

matrix 3x3 with cell vectors on the rows

celldisp (float[3,1])

displacement of cell from origin

info (dict)

field for additional information related to structure

The order of atoms in this dictionary is the one found in the simulation input file.

Custom Type: ADSORBATE

reference_id (int)

ID of the simulation to use as reference

atom_ids (int[])

atom indices in the ATOMS dictionary of the simulation record

site_class (str)

class of adsorption site: “top”, “bridge”, “hollow”, “4-fold hollow”

site_ids (int[])

list of atom ids (in simulation record) that define the adsorption site

Custom Type: NANOCLUSTER

In general, simulation.nanoclusters is a list of dictionaries with this structure.

reference_id (int)

ID of the simulation where this cluster was made, -1 if original

atom_ids (int[])

atom indices in the ATOMS dictionary of the simulation record

Custom Type: SUBSTRATE

reference_id (int)

ID of the parent support simulation, -1 if no parent

atom_ids (int[])

atom indices in the corresponding ATOMS dictionary. See below

Database query examples

A few examples how to query that database are given in the gui/ folder on the github repository.