Skip to content

PTI-Clima/data_flow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image data_flow

Summary

A complete data flow process. It takes takes as input quality-controlled fragmentary observational time series of meteorological variables (QC data), and produces as output gap-filled and homogeneized time series (Station data) and 3D gridded datasets (Gridded data). The process integrates the following steps:

image

  1. Data pre-processing (pp): Taking as input the output of the quality control process, performs various tasks to prepare the data for further processing within the LCSC data flow: crop to the desired analysis period, variable aggregation and transformations, metadata augmentation (manual vs. automatic stations, spatial domain), selection of candidate and auxiliary stations, and computation of distance matrices.

  2. Gap-filling (gf): Taking the output of the pre-processing process as input, fill the gaps (NAs) of the candidate stations, based on the data of other candidate or auxiliary stations.

  3. Gap-filling validation (gf_valid): Generates a cross-validation report of the gap-filling process.

  4. Homogeneisation (hg): Takes as input the output of the gap-filling process, performs the Alexanderson's (1986) SNHT test to detect, and (if requested) correct inhomogeneities in the data series.

  5. Gridding (gr): Performs spatial interpolation on the gap-filled and homogeneised station data series, to obtain continuous grids of the variable.

  6. Gridding validation (gr_valid): Performs a cross-validation of the spatial interpolation process.

Details

The code includes a main.R file, which sources a series of files: a configuration file config.yml with particularities of how to process each target variable; a functions functions.R file with the necessary functions to run the process; a series of markdown <proc>_report.Rmd file, which are used to produce global reports for each process.

As a result, the script generates a series of .Rdata files, one for each processed variable. These files contains a single object, with all the necessary items (station metadata, data matrix, configuration options, etcetera). All scripts generate a global report, and most steps also generate individual reports (one file for each data series / station).

The image below summarises the flow of data and code:

Flow of data and code

Use instructions

  1. From the terminal, run Rscript R/main.R --help or Rscript R/main.R -h to receive the following use instructions:
usage: main.R [--] [--help] [--trial] [--no_global_rep]
       [--no_indiv_rep] [--opts OPTS] [--procs PROCS] [--verbosity
       VERBOSITY] [--ncores NCORES] var

Data flow

positional arguments:
  var                  variable to process: one of `tmax`, `tmin`,
                       `pr`, `hr`, `hr2`, `inso`, `p`, `p2`, `ssrd`,
                       `ws`, `ws2`, `wmax`, `wd`, `maxd`

flags:
  -h, --help           show this help message and exit
  -t, --trial          Enable Trial Mode
  -n, --no_global_rep  do not produce global report
  --no_indiv_rep       do not produce individual reports

optional arguments:
  -x, --opts           RDS file containing argument values
  -p, --procs          processes to run: one or more of `pp`, `gf`,
                       `gf_valid`, `hg`, `gr`, `gr_valid`, or all if
                       ommited
  -v, --verbosity      verbosity level. 0: print no messages; )> 0 :
                       print status messages [default: 1]
  -c, --ncores         number of cores: 1: one (mono-thread); n > 1:
                       multi-thread, use n cores; 0: multi-thread, use
                       all available cores [default: 1]
  1. Argument var, indicating the variable to process, is compulsory.

  2. If no process or processes are provided with -p, all the processes will be run in sequential order. It is possible to specify one or several processes to be run, with the flag -p or --procs, e.g.: Rscript R/main.R <var> - p <proc>, where the codes of the processes are as specified above (pp, gf, gf_valid, hg, gr, gr_valid). If more than one process is specified, the program runs them in the correct order.

  3. By default, the program outputs status messages during the process. To avoid this, use -v 0 to set the verbosity level to zero.

  4. By default, the processes are applied to the complete data set. The flag -t or --trial enables the trial mode, where only a fraction of the data are used in order to speed up the process.

  5. By default, the process is run in one core (mono-thread). Some processes, though (pp and gr, mostly) greatly benefit from running in multi-thread. For that, set -c to a value greater than one to specify a number of cores to use, or set it to zero to let the system take all the cores (minus one).

Configuration file

config.yml – Configuration options used in all steps other than the quality control. Named configurations (there is one for each variable used in the PTI) are:

  • tmax: maximum daily temperature (from AEMET's variable "TMAX").
  • tmin: minimum daily temperature (from AEMET's variable "TMIN").
  • pr: daily cumulative precipitation (from AEMET's variable "P").
  • hr: daily mean relative humidity, average of four synoptic measurements (from AEMET's variables "HU00", "HU07", "HU13", "HU18").
  • hr2: daily mean relative humidity, average of daily max and min (from AEMET's variables "HUMAX", "HUMIN").
  • tdew: daily mean dewpoint temperature (from AEMET's variables: NOT IMPLEMENTED).
  • p: daily mean surface level atmospheric pressure, average of four synoptic measurements (from AEMET's variables "PRES00", "PRES07", "PRES13", "PRES18").
  • p2: daily mean surface level atmospheric pressure, average of daily max and min (from AEMET's variables "PRESMAX", "PRESMIN").
  • ssrd: daily total global radiation (from AEMET's variable "RGLODIA").
  • ws: daily average wind speed, average of four synoptic measurements (from AEMET's variables "VEL_00", "VEL_07", "VEL_13", "VEL_18").
  • ws2: daily average wind speed, from daily wind travel 07-07 h (from AEMET's variable "REC77").
  • wmax: highest daily wind gust (from AEMET's variable "R_MAX_VEL").

For each variable, the configuration file contains the following items:

  version: version number (can be used to differenciate between different runs with different options)

  dir: output directories
    qc_dir: quality control output directory
    pp_dir: pre-process output directory
    pre-hg_dir: homogeneisation prior to gap-filling directory (experimental)
    gf_dir: gap-filling output directory
    hg_dir: homogeneisation output directory
    gr_dir: gridding output directory

  period_analysis:
    start: starting date of the analysis period
    end: ending date of the analysis period

  var:
    name: variable's name (internal id)
    longname: variable's long name
    AEMET_name: name of the AEMET's variable(s) this variable is based on
    scale: wether the scale of the variable is absolute (e.g., precipitation) or relative (e.g., temperature in Celsius)
    unit: variable's unit
    factor: scaling factor
    range: allowed range for the variable
      lower: variables' lower limit (e.g., 0 for precipitation)
      upper: variables' upper limit (e.g. Inf)
    outlier_tolerance: percent tolerance of outliers above or below climatological maxima or minima

  cand: criteria to select candidate series
    min_years: minimum number of years to consider a series as candidate
    exclude_auto: whether to exclude automatic stations
    exclude_inactive: whether to exclude inactive stations
    min_stations_per_prov: minimum number of stations per province (the algorithm will relax other criteria until it matches this minimum, or no more stations are available)

  aux: criteria for auxiliary series
    min_years: minimum number of years to consider a series as candidate
    distance: type of distance matrix to use when selecting the auxiliary series for a given candidate; it can be either "correlation" or "euclidean"
    overlap_years: miniumum number of overlapping years to consider an auxiliary series; used only if a correlation distance matrix is used.
    max_dist_km: maximum distance allowed for an auxiliary station to be considered
    exclude: list of stations that must be excluded due to problems identified during visual inspection

  gf:
    method: which filling method to use (one of "direct", "diff", "ratio", "logratio" or "qmap")
    seasonal: whether to use seasonally varying parameters (only applies if method is one of "ratio", "diff", "direct" or "logratio")

  hg:
    correction: inhomogeneities correction type (one of "annual", "monthly", "daily", or "none" for no correction)
    method: which method should be use to correct inhomogeneities (on of "diff" or "ratio")

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 7