data_flow

Summary

A complete data flow process. It takes takes as input quality-controlled fragmentary observational time series of meteorological variables (QC data), and produces as output gap-filled and homogeneized time series (Station data) and 3D gridded datasets (Gridded data). The process integrates the following steps:

Data pre-processing (pp): Taking as input the output of the quality control process, performs various tasks to prepare the data for further processing within the LCSC data flow: crop to the desired analysis period, variable aggregation and transformations, metadata augmentation (manual vs. automatic stations, spatial domain), selection of candidate and auxiliary stations, and computation of distance matrices.
Gap-filling (gf): Taking the output of the pre-processing process as input, fill the gaps (NAs) of the candidate stations, based on the data of other candidate or auxiliary stations.
Gap-filling validation (gf_valid): Generates a cross-validation report of the gap-filling process.
Homogeneisation (hg): Takes as input the output of the gap-filling process, performs the Alexanderson's (1986) SNHT test to detect, and (if requested) correct inhomogeneities in the data series.
Gridding (gr): Performs spatial interpolation on the gap-filled and homogeneised station data series, to obtain continuous grids of the variable.
Gridding validation (gr_valid): Performs a cross-validation of the spatial interpolation process.

Details

The code includes a main.R file, which sources a series of files: a configuration file config.yml with particularities of how to process each target variable; a functions functions.R file with the necessary functions to run the process; a series of markdown <proc>_report.Rmd file, which are used to produce global reports for each process.

As a result, the script generates a series of .Rdata files, one for each processed variable. These files contains a single object, with all the necessary items (station metadata, data matrix, configuration options, etcetera). All scripts generate a global report, and most steps also generate individual reports (one file for each data series / station).

The image below summarises the flow of data and code:

Use instructions

From the terminal, run Rscript R/main.R --help or Rscript R/main.R -h to receive the following use instructions:

usage: main.R [--] [--help] [--trial] [--no_global_rep]
       [--no_indiv_rep] [--opts OPTS] [--procs PROCS] [--verbosity
       VERBOSITY] [--ncores NCORES] var

Data flow

positional arguments:
  var                  variable to process: one of `tmax`, `tmin`,
                       `pr`, `hr`, `hr2`, `inso`, `p`, `p2`, `ssrd`,
                       `ws`, `ws2`, `wmax`, `wd`, `maxd`

flags:
  -h, --help           show this help message and exit
  -t, --trial          Enable Trial Mode
  -n, --no_global_rep  do not produce global report
  --no_indiv_rep       do not produce individual reports

optional arguments:
  -x, --opts           RDS file containing argument values
  -p, --procs          processes to run: one or more of `pp`, `gf`,
                       `gf_valid`, `hg`, `gr`, `gr_valid`, or all if
                       ommited
  -v, --verbosity      verbosity level. 0: print no messages; )> 0 :
                       print status messages [default: 1]
  -c, --ncores         number of cores: 1: one (mono-thread); n > 1:
                       multi-thread, use n cores; 0: multi-thread, use
                       all available cores [default: 1]

Argument var, indicating the variable to process, is compulsory.
If no process or processes are provided with -p, all the processes will be run in sequential order. It is possible to specify one or several processes to be run, with the flag -p or --procs, e.g.: Rscript R/main.R <var> - p <proc>, where the codes of the processes are as specified above (pp, gf, gf_valid, hg, gr, gr_valid). If more than one process is specified, the program runs them in the correct order.
By default, the program outputs status messages during the process. To avoid this, use -v 0 to set the verbosity level to zero.
By default, the processes are applied to the complete data set. The flag -t or --trial enables the trial mode, where only a fraction of the data are used in order to speed up the process.
By default, the process is run in one core (mono-thread). Some processes, though (pp and gr, mostly) greatly benefit from running in multi-thread. For that, set -c to a value greater than one to specify a number of cores to use, or set it to zero to let the system take all the cores (minus one).

Configuration file

config.yml – Configuration options used in all steps other than the quality control. Named configurations (there is one for each variable used in the PTI) are:

tmax: maximum daily temperature (from AEMET's variable "TMAX").
tmin: minimum daily temperature (from AEMET's variable "TMIN").
pr: daily cumulative precipitation (from AEMET's variable "P").
hr: daily mean relative humidity, average of four synoptic measurements (from AEMET's variables "HU00", "HU07", "HU13", "HU18").
hr2: daily mean relative humidity, average of daily max and min (from AEMET's variables "HUMAX", "HUMIN").
tdew: daily mean dewpoint temperature (from AEMET's variables: NOT IMPLEMENTED).
p: daily mean surface level atmospheric pressure, average of four synoptic measurements (from AEMET's variables "PRES00", "PRES07", "PRES13", "PRES18").
p2: daily mean surface level atmospheric pressure, average of daily max and min (from AEMET's variables "PRESMAX", "PRESMIN").
ssrd: daily total global radiation (from AEMET's variable "RGLODIA").
ws: daily average wind speed, average of four synoptic measurements (from AEMET's variables "VEL_00", "VEL_07", "VEL_13", "VEL_18").
ws2: daily average wind speed, from daily wind travel 07-07 h (from AEMET's variable "REC77").
wmax: highest daily wind gust (from AEMET's variable "R_MAX_VEL").

For each variable, the configuration file contains the following items:

  version: version number (can be used to differenciate between different runs with different options)

  dir: output directories
    qc_dir: quality control output directory
    pp_dir: pre-process output directory
    pre-hg_dir: homogeneisation prior to gap-filling directory (experimental)
    gf_dir: gap-filling output directory
    hg_dir: homogeneisation output directory
    gr_dir: gridding output directory

  period_analysis:
    start: starting date of the analysis period
    end: ending date of the analysis period

  var:
    name: variable's name (internal id)
    longname: variable's long name
    AEMET_name: name of the AEMET's variable(s) this variable is based on
    scale: wether the scale of the variable is absolute (e.g., precipitation) or relative (e.g., temperature in Celsius)
    unit: variable's unit
    factor: scaling factor
    range: allowed range for the variable
      lower: variables' lower limit (e.g., 0 for precipitation)
      upper: variables' upper limit (e.g. Inf)
    outlier_tolerance: percent tolerance of outliers above or below climatological maxima or minima

  cand: criteria to select candidate series
    min_years: minimum number of years to consider a series as candidate
    exclude_auto: whether to exclude automatic stations
    exclude_inactive: whether to exclude inactive stations
    min_stations_per_prov: minimum number of stations per province (the algorithm will relax other criteria until it matches this minimum, or no more stations are available)

  aux: criteria for auxiliary series
    min_years: minimum number of years to consider a series as candidate
    distance: type of distance matrix to use when selecting the auxiliary series for a given candidate; it can be either "correlation" or "euclidean"
    overlap_years: miniumum number of overlapping years to consider an auxiliary series; used only if a correlation distance matrix is used.
    max_dist_km: maximum distance allowed for an auxiliary station to be considered
    exclude: list of stations that must be excluded due to problems identified during visual inspection

  gf:
    method: which filling method to use (one of "direct", "diff", "ratio", "logratio" or "qmap")
    seasonal: whether to use seasonally varying parameters (only applies if method is one of "ratio", "diff", "direct" or "logratio")

  hg:
    correction: inhomogeneities correction type (one of "annual", "monthly", "daily", or "none" for no correction)
    method: which method should be use to correct inhomogeneities (on of "diff" or "ratio")

Name		Name	Last commit message	Last commit date
Latest commit History 469 Commits
.github/workflows		.github/workflows
R		R
data		data
data_raw		data_raw
man/figures		man/figures
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
data_flow.Rproj		data_flow.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data_flow

Summary

Details

Use instructions

Configuration file

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

PTI-Clima/data_flow

Folders and files

Latest commit

History

Repository files navigation

data_flow

Summary

Details

Use instructions

Configuration file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages