A complete data flow process. It takes takes as input quality-controlled fragmentary observational time series of meteorological variables (QC data), and produces as output gap-filled and homogeneized time series (Station data) and 3D gridded datasets (Gridded data). The process integrates the following steps:
-
Data pre-processing (pp): Taking as input the output of the quality control process, performs various tasks to prepare the data for further processing within the LCSC data flow: crop to the desired analysis period, variable aggregation and transformations, metadata augmentation (manual vs. automatic stations, spatial domain), selection of candidate and auxiliary stations, and computation of distance matrices.
-
Gap-filling (gf): Taking the output of the pre-processing process as input, fill the gaps (NAs) of the candidate stations, based on the data of other candidate or auxiliary stations.
-
Gap-filling validation (gf_valid): Generates a cross-validation report of the gap-filling process.
-
Homogeneisation (hg): Takes as input the output of the gap-filling process, performs the Alexanderson's (1986) SNHT test to detect, and (if requested) correct inhomogeneities in the data series.
-
Gridding (gr): Performs spatial interpolation on the gap-filled and homogeneised station data series, to obtain continuous grids of the variable.
-
Gridding validation (gr_valid): Performs a cross-validation of the spatial interpolation process.
The code includes a main.R file, which sources a series of files: a configuration file config.yml with particularities of how to process each target variable; a functions functions.R file with the necessary functions to run the process; a series of markdown <proc>_report.Rmd file, which are used to produce global reports for each process.
As a result, the script generates a series of .Rdata files, one for each processed variable. These files contains a single object, with all the necessary items (station metadata, data matrix, configuration options, etcetera). All scripts generate a global report, and most steps also generate individual reports (one file for each data series / station).
The image below summarises the flow of data and code:
- From the terminal, run
Rscript R/main.R --helporRscript R/main.R -hto receive the following use instructions:
usage: main.R [--] [--help] [--trial] [--no_global_rep]
[--no_indiv_rep] [--opts OPTS] [--procs PROCS] [--verbosity
VERBOSITY] [--ncores NCORES] var
Data flow
positional arguments:
var variable to process: one of `tmax`, `tmin`,
`pr`, `hr`, `hr2`, `inso`, `p`, `p2`, `ssrd`,
`ws`, `ws2`, `wmax`, `wd`, `maxd`
flags:
-h, --help show this help message and exit
-t, --trial Enable Trial Mode
-n, --no_global_rep do not produce global report
--no_indiv_rep do not produce individual reports
optional arguments:
-x, --opts RDS file containing argument values
-p, --procs processes to run: one or more of `pp`, `gf`,
`gf_valid`, `hg`, `gr`, `gr_valid`, or all if
ommited
-v, --verbosity verbosity level. 0: print no messages; )> 0 :
print status messages [default: 1]
-c, --ncores number of cores: 1: one (mono-thread); n > 1:
multi-thread, use n cores; 0: multi-thread, use
all available cores [default: 1]
-
Argument
var, indicating the variable to process, is compulsory. -
If no process or processes are provided with
-p, all the processes will be run in sequential order. It is possible to specify one or several processes to be run, with the flag-por--procs, e.g.:Rscript R/main.R <var> - p <proc>, where the codes of the processes are as specified above (pp,gf,gf_valid,hg,gr,gr_valid). If more than one process is specified, the program runs them in the correct order. -
By default, the program outputs status messages during the process. To avoid this, use
-v 0to set the verbosity level to zero. -
By default, the processes are applied to the complete data set. The flag
-tor--trialenables the trial mode, where only a fraction of the data are used in order to speed up the process. -
By default, the process is run in one core (mono-thread). Some processes, though (pp and gr, mostly) greatly benefit from running in multi-thread. For that, set
-cto a value greater than one to specify a number of cores to use, or set it to zero to let the system take all the cores (minus one).
config.yml – Configuration options used in all steps other than the quality control. Named configurations (there is one for each variable used in the PTI) are:
tmax: maximum daily temperature (from AEMET's variable "TMAX").tmin: minimum daily temperature (from AEMET's variable "TMIN").pr: daily cumulative precipitation (from AEMET's variable "P").hr: daily mean relative humidity, average of four synoptic measurements (from AEMET's variables "HU00", "HU07", "HU13", "HU18").hr2: daily mean relative humidity, average of daily max and min (from AEMET's variables "HUMAX", "HUMIN").tdew: daily mean dewpoint temperature (from AEMET's variables: NOT IMPLEMENTED).p: daily mean surface level atmospheric pressure, average of four synoptic measurements (from AEMET's variables "PRES00", "PRES07", "PRES13", "PRES18").p2: daily mean surface level atmospheric pressure, average of daily max and min (from AEMET's variables "PRESMAX", "PRESMIN").ssrd: daily total global radiation (from AEMET's variable "RGLODIA").ws: daily average wind speed, average of four synoptic measurements (from AEMET's variables "VEL_00", "VEL_07", "VEL_13", "VEL_18").ws2: daily average wind speed, from daily wind travel 07-07 h (from AEMET's variable "REC77").wmax: highest daily wind gust (from AEMET's variable "R_MAX_VEL").
For each variable, the configuration file contains the following items:
version: version number (can be used to differenciate between different runs with different options)
dir: output directories
qc_dir: quality control output directory
pp_dir: pre-process output directory
pre-hg_dir: homogeneisation prior to gap-filling directory (experimental)
gf_dir: gap-filling output directory
hg_dir: homogeneisation output directory
gr_dir: gridding output directory
period_analysis:
start: starting date of the analysis period
end: ending date of the analysis period
var:
name: variable's name (internal id)
longname: variable's long name
AEMET_name: name of the AEMET's variable(s) this variable is based on
scale: wether the scale of the variable is absolute (e.g., precipitation) or relative (e.g., temperature in Celsius)
unit: variable's unit
factor: scaling factor
range: allowed range for the variable
lower: variables' lower limit (e.g., 0 for precipitation)
upper: variables' upper limit (e.g. Inf)
outlier_tolerance: percent tolerance of outliers above or below climatological maxima or minima
cand: criteria to select candidate series
min_years: minimum number of years to consider a series as candidate
exclude_auto: whether to exclude automatic stations
exclude_inactive: whether to exclude inactive stations
min_stations_per_prov: minimum number of stations per province (the algorithm will relax other criteria until it matches this minimum, or no more stations are available)
aux: criteria for auxiliary series
min_years: minimum number of years to consider a series as candidate
distance: type of distance matrix to use when selecting the auxiliary series for a given candidate; it can be either "correlation" or "euclidean"
overlap_years: miniumum number of overlapping years to consider an auxiliary series; used only if a correlation distance matrix is used.
max_dist_km: maximum distance allowed for an auxiliary station to be considered
exclude: list of stations that must be excluded due to problems identified during visual inspection
gf:
method: which filling method to use (one of "direct", "diff", "ratio", "logratio" or "qmap")
seasonal: whether to use seasonally varying parameters (only applies if method is one of "ratio", "diff", "direct" or "logratio")
hg:
correction: inhomogeneities correction type (one of "annual", "monthly", "daily", or "none" for no correction)
method: which method should be use to correct inhomogeneities (on of "diff" or "ratio")


