Checklist

Before a production run of a workflow, it is recommended to go once through this checklist.

Obligatory

These points must be met:

  • Is the information for the launchpad correct? (username, password) in both workflow submission and launching script?

  • Are the arguments username and password correct in the workflow function?

  • Does worker_target_path on the computing ressource exist?

  • Did the test run with a LJ-potential terminate correctly?

  • Did you switch to the correct DFT template (template_path)?

  • Did you change the database to the production run database? (Argument in worfklow: e.g. extdb_connect = {"db_name": "ncdb"})

  • Are the walltimes and number of cores in the QueueAdapters of the launching script sensible?

Troubleshooting

If your workflow stops ahead of time, it might have fizzled or been defused.

If it fizzled, it means there is an unexpected failure. Investigate the error in the corresponding launch directory. Once you have fixed the issue (e.g. missing file, typo, walltime, bug in the code, …) you can rerun a workflow from a specific firework through

lpad -l <LAUNCHPAD> rerun_fws -i <FIREWORKID>

If the workflow has been defused, there is usually an expected reason for it. A set convergence criterion might have been met or a loop has become static (no more changes detected) …

In most cases you can choose to keep on running it until the next stop point. Use the command

lpad -l <LAUNCHPAD> reignite_wflows -i <WORKFLOWID>

If jobs are marked as RUNNING but have clearly finished, they might be lost. Use lpad -l <LAUNCHPAD> detect_lostruns to look for them. Either add the argument --fizzle or --rerun. For more options use the --help argument.

If you have found a bug, please raise an issue on github. You can also make a pull request if you have found a solution to the problem.