.. _database: Database ======== For storing information about the workflows, machine learning, DFT simlations and other structure manipulations, MongoDB is used as a permament, shared database. MongoDB is a NoSQL database, with bson dictionaries as documents, which is very close to python documents. If you know python, MongoDB documents are easy to read and query. Database format --------------- In the following, we list the entries that data records need to have in each collection. A data record is assumed to be a python dictionary, and the tables give the keys that are expected to be found therein. The following collections are defined: IDs, simulations, workflows and machine_learning. Collection: IDs ^^^^^^^^^^^^^^^ contains the ID values that will be assigned to the newest entries in other collections :_id (int): automatic compulsory internal id for MongoDB :simulations (int): ID to be used by the next new entry of type simulation :workflows (int): ID to be used by the next new entry of type workflows :machine_learning (int): ID to be used by the next new entry of type machine_learning (notice how the field name matches the collection name) Collection: simulations ^^^^^^^^^^^^^^^^^^^^^^^ :_id (int): unique identifier :source_id (int): ID of the parent simulation that originated this, -1 if none :workflow_id (int): ID of workflow when instance was added, -1 if none :wf_sim_id (int): ID of simulation (unique within the workflow this belongs to) :atoms (ATOMS): dictionary with information about the atoms. :nanoclusters (NANOCLUSTER): list of dictionaries with information about the nanocluster(s) :adsorbates (ADSORBATE): list of dictionaries with information about the adsorbate(s) :substrates (SUBSTRATE): list of dictionaries with information about the substrate(s) :operations (list): List of dictionaries, each describing one operation. Always with respect to the parent simulation if applicable :inp (dict): property/value pairs describing the simulation input :output (dict): property/value pairs output by the calculation For custom types ATOMS, NANOCLUSTER, ADSORBATE and SUBSTRATE see below. Collection: workflows ^^^^^^^^^^^^^^^^^^^^^ :_id (int): unique identifier :username (str): user who executed the workflow :creation_time (str): time of creation of the workflow :parameters (dict): workflow-specific parameters :name (str): custom name of workflow :workflow_type (str): custom type of workflow Collection: machine_learning ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :_id (int): unique identifier :workflow_id (int): ID of workflow which the machine learning run was part of :method (str): name of the ML method: krr, nn, ... :method_params (dict): Parameters of the method :descriptor (str): name of the descriptor: soap, mbtr, cm, ... :descriptor_params (dict): Parameters of the descriptor used :training_set (int[]): list of simulation IDs used for training :validation_set (int[]): list of simulation IDs used in validation. If empty, cross-validation was used. :test_set (int[]): list of simulation IDs used in testing. If empty, only validation was used :prediction_set (int[]): list of simulation IDs used for prediction. :metrics_training (dict): dictionary of (“metric name”: value) on training set key: string = name of the metric value: float = calculated value :metrics_validation (dict): dictionary of (“metric name”: value) on validation set key: string = name of the metric value: float = calculated value :metrics_test (dict): dictionary of (“metric name”: value) on test set key: string = name of the metric value: float = calculated value :output (dict): relevant training output info Ideally, method name corresponds to a python class/function in the platform, that is initialised with the parameter dictionary given in method_params. Similarly, descriptor name also matches a python class, to be initialised with its own given set of parameters, descriptor_params. The field output is a dictionary with all the useful output values from the calculation. Custom Type: ATOMS ^^^^^^^^^^^^^^^^^^ A dictionary for describing atoms in a system, conceptually close to ase.Atoms object: :numbers (int[]): list of atomic numbers as numpy array [N] of ints :positions (float[N,3]): positions as numpy matrix [Nx3] of doubles :constraints (int[N,3]): frozen flags a matrix [Nx3] of int [optional] 1 = frozen, 0 = free :pbc (bool): use periodic boundaries :cell (float[3,3]): matrix 3x3 with cell vectors on the rows :celldisp (float[3,1]): displacement of cell from origin :info (dict): field for additional information related to structure The order of atoms in this dictionary is the one found in the simulation input file. Custom Type: ADSORBATE ^^^^^^^^^^^^^^^^^^^^^^ :reference_id (int): ID of the simulation to use as reference :atom_ids (int[]): atom indices in the ATOMS dictionary of the simulation record :site_class (str): class of adsorption site: “top”, “bridge”, “hollow”, “4-fold hollow” :site_ids (int[]): list of atom ids (in simulation record) that define the adsorption site Custom Type: NANOCLUSTER ^^^^^^^^^^^^^^^^^^^^^^^^ In general, simulation.nanoclusters is a list of dictionaries with this structure. :reference_id (int): ID of the simulation where this cluster was made, -1 if original :atom_ids (int[]): atom indices in the ATOMS dictionary of the simulation record Custom Type: SUBSTRATE ^^^^^^^^^^^^^^^^^^^^^^ :reference_id (int): ID of the parent support simulation, -1 if no parent :atom_ids (int[]): atom indices in the corresponding ATOMS dictionary. See below Database query examples ----------------------- A few examples how to query that database are given in the gui/ folder on the github repository.