Global Hazard/Risk Model: managing repos and calculations

We need to be able to manage all the repositories and to run all the calculations needed for the Global Hazard Model and Global Risk Model with a Command Line Interface.

The idea is to use a single multi-job configuration file in TOML format to describe the workflow. The workflow file will contain references to the `job.ini` files of the underlying calculations and will have the ability to override parameters.
This is a generally useful feature. For instance, in sensitivity analysis studies, the workflow would consist of calculations which are variations of a base computation with different parameters. In scenarios from ShakeMaps, different calculations in the workflow could refers to different versions of the underlying ShakeMap. 

There are cases when a single output requires performing multiple calculations, like for instance computing the full stochastic event set for Africa requires combining four hazard models: North Africa (NAF), South Saharian Africa (SSA), West Africa (WAF) and South Africa (ZAF). In this example, one has to perform four calculations that goes conceptually together and then perform a postprocessing step combining the ruptures and producing a single HDF5 file with the result. Or consider the computation of the Global Hazard Map, that requires performing 30+ calculations and several postprocessing steps. Computing the Global Risk Map is even more complex, requiring over 200 calculations.


We can manage all of those cases by introducing a single concept, called *workflow*. A workflow is a *collection of jobs plus a few optional postprocessing steps that can be described with a TOML configuration file*.

The TOML file will contain a section with a few global parameters equal for all the jobs included in the workflow, and a section for each job with the job-specific parameters, starting from the location of the associated .ini file. Finally there could be a section description what postprocessing function to call in case of success.

A simple example configuration could be the following:

```toml
[workflow]
calculation_mode = "event_based"
ground_motion_fields = false
number_of_logic_tree_samples = 2000
ses_per_logic_tree_path = 50
minimum_magnitude = 5

[EUR]
ini = "EUR/in/job_vs30.ini"

[MIE]
ini = "MIE/in/job_vs30.ini"

[success]
func = "openquake.engine.postjobs.build_ses"
out_file = "EUR_MIE_ses.hdf5"
```

The workflow parameters will take the precedence over the corresponding parameters in the .ini files, to guarantee uniformity of the parameters across the workflow. The engine has the freedom to process the jobs in parallel, however they are ordered and the calculation IDs are numbered as you would expected (i.e. MIE is immediately after EUR in the file, so its calculation ID is ID(MIE) = ID(EUR) + 1).

For the purpose of risk computations, simple workflows are not enough, and therefore the engine also supports *multistep workflows*, i.e. workflows of workflows, which are run sequentially one after the other. For instance, generating the SES for both Africa and Middle East could be done as follows:

```toml
[multi.workflow]
description = "SES for Africa and Middle East"
calculation_mode = "event_based"
ground_motion_fields = false
number_of_logic_tree_samples = 2000
ses_per_logic_tree_path = 50
minimum_magnitude = 5

[Africa.NAF]
ini = "NAF/in/job_vs30.ini"
[Africa.SSA]
ini = "SSA/in/job_vs30.ini"
[Africa.WAF]
ini = "WAF/in/job_vs30.ini"
[Africa.ZAF]
ini = "ZAF/in/job_vs30.ini"
[Africa.success]
func = "openquake.engine.postjobs.build_ses"
out_file = "Africa_ses.hdf5"

[Middle_East.ARB]
ini = "ARB/in/job_vs30.ini"
[Middle_East.MIE]
ini = "MIR/in/job_vs30.ini"
[Middle_East.success]
func = "openquake.engine.postjobs.build_ses"
out_file = "Middle_East_ses.hdf5"
```
This multistep workflow contains a subworkflow for Africa and one for Middle_East, both with the same global parameters such as minimum_magnitude. The postprocessing step for Africa will generate the SES for Africa and the postprocessing step for Middle_East will generate the SES for Middle East, one after the other. However, inside a subworkflow the engine is free to process the jobs in parallel.

In general `success` functions take as input the job IDs and the extra parameters in the .toml file (in this example the `out_file`); for instance `build_ses` is defined more or less as follows:

```python
def build_ses(job_ids, out_file):
    fnames = [datastore.read(job_id).filename for job in jobs]
    logging.warning(f'Saving {out_file}')
     with hdf5.File(out_file, 'w') as h5:
        base.import_sites_hdf5(h5, fnames)
        base.import_ruptures_hdf5(h5, fnames)
```

At the database level a workflow record will be a job record with `calculation_mode=workflow`. It will be associated
to an HDF5 file named `calc_XXX.hdf5` which internally will store useful information such as which jobs belong to the workflow and their status ("complete", "failed", etc).

If at least one job fails than the `success` function will not be called; if it is called.

The plan is automatically remove expired workflows, unless the field `relevant` is manually set to true. This is important for the purpose of the Global Hazard and Risk models, where hundreds of calculations are expected to run and we need to run a cleanup procedure removing old calculations (say with an expiration time of 30 days or so). This will be configurable in openquake.cfg. An empty `expiration_time` will mean that calculations have to be removed manually or they will stay forever.

Since jobs have a foreign key to `workflow` calculations with `ON DELETE CASCADE`, removing a workflow will automatically remove all the jobs in that workflow.

Continue functionality
------------------------------------

The essential feature that makes the workflow concept really useful is the `continue` functionality that allows to continue a failed workflow without having to repeat the successful calculations.

For instance, suppose you have a workflow with 230 calculations, like in the global risk model, and assume that calculation number 229 fails for same reason (say an out of memory issue). Then you can fix the problem, for instance by optimising the memory consumption of the engine or by reducing the calculation, and run again the
workflow file: this time it will be fast, since the previous 228 calculations will be retrieved and only the last two calculations will be performed. More concretely, you will give the following commands:

```bash
$ oq run grm.toml  --workflow "Global Risk Model v2025.0"
Running workflow [42]
...
Error: failed jobs
# fix the failing job
$ oq run grm.toml --workflow 42
Running workflow [42]
...
Success!
$ oq db keep 42  # mark workflow 42 as important and delete the other workflows 
```

NB: the commands `oq ses` and `oq mosaic` might be removed at the end of the process, since they can replaced by new   CLI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Global Hazard/Risk Model: managing repos and calculations #11029

Continue functionality

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Global Hazard/Risk Model: managing repos and calculations #11029

Description

Continue functionality

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions