Implement R2D2 Ingest Suite #675

ftgoktas · 2025-12-11T21:14:20Z

Description

This PR introduces a new observation ingestion suite for Swell, enabling automated ingestion of observation and background data into R2D2 v3 similar to Skylab/Ewok architecture.

Key Features:

Modular YAML configuration: Each observation type (e.g., adt_cryosat2n, adt_sentinel6a) has its own standalone YAML file with retrieval method and metadata.
R2D2 duplicate detection: Automatically checks if observations already exist in R2D2 before ingestion to avoid unnecessary copies.
Dry-run mode: Test ingestion workflows without actually storing files to R2D2.

Usage:
swell create ingest_obs_marine
swell launch <experiment_path>

The suite automatically ingests all observations listed in obs_to_ingest for each cycle point, skipping files already present in R2D2.

…ure/r2d2_v3

…nto feature/r2d2_v3

src/swell/tasks/ingest_background.py

src/swell/tasks/ingest_obs.py

ashiklom

See my comments. TLDR:

Glob logic can be simplified.
Add type hints
Don't catch generic Exception if we don't immediately re-raise it.

mer-a-o · 2025-12-15T18:30:23Z

Thanks @ftgoktas for adding this feature. Some general comments before getting into details:

I don't understand the point of dry_run. Is it more like a search functionality? If yes, can we change the name to something else as dry_run in the context of cycling means something else.
I suggest adding observation ingest in a separate PR. How would you add running ioda-converter step to this task?
Please add a clear description about each of the options in the yaml.

ftgoktas · 2025-12-15T19:24:33Z

Thanks @ftgoktas for adding this feature. Some general comments before getting into details:

I don't understand the point of dry_run. Is it more like a search functionality? If yes, can we change the name to something else as dry_run in the context of cycling means something else.

I suggest adding observation ingest in a separate PR. How would you add running ioda-converter step to this task?

Please add a clear description about each of the options in the yaml.

dry_run mode validates files exist and logs what would be ingested, but skips r2d2.store(). It's for testing ingestion logic before committing data to R2D2.
It's not a search functionality, it still runs the full task logic (file pattern matching, existence checks, metadata lookup), just without the final storage step. We can rename to preview_mode or test_ingest if dry_run is confusing in the Cylc context.
In Ewok/Skylab's pattern, ingestion is a separate task (storeObservations) that runs after conversion (convertObservations). Swell already has conversion tasks (BufrToIoda), so IngestObs will complete the workflow by handling the R2D2 storage step.

ashiklom · 2025-12-15T20:00:43Z

I don't understand the point of dry_run. Is it more like a search functionality? If yes, can we change the name to something else as dry_run in the context of cycling means something else.

FWIW, dry-run is a pretty common flag in software engineering with a widely understood definition of "print what the command would have done but don't actually do it". E.g., rsync, aws s3 sync, and lots of other commands that do big/many file modifications have a --dry-run option to first check that the files are going to end up where you expect them to before you actually move/copy/delete anything. So my vote would be to keep this name!

In the context of rsync: Can you predict exactly what each of these commands will do? They do different things, but are you 100% sure you know the result?

rsync -av path/to/source path/to/destination
rsync -av path/to/source path/to/destination/
rsync -av path/to/source/ path/to/destination
rsync -av path/to/source/ path/to/destination/

It's very easy to mess up a directory or even to delete files if you accidentally use the wrong combination of trailing slashes. rsync --dry-run is handy because it will print each individual copy operation it would do (and raise any errors/warnings it might encounter) but will not actually do the operations, so you can quickly confirm that you're doing what you intend.

src/swell/configuration/jedi/interfaces/geos_marine/ingest_backgrounds/geos_restart.yaml

src/swell/tasks/ingest_background.py

src/swell/configuration/jedi/interfaces/geos_marine/ingest_backgrounds/geos_restart.yaml

src/swell/configuration/jedi/interfaces/geos_marine/ingest_observations/adt_cryosat2n.yaml

src/swell/configuration/jedi/interfaces/geos_marine/ingest_observations/adt_sentinel6a.yaml

src/swell/configuration/jedi/interfaces/geos_marine/ingest_backgrounds/geos_restart.yaml

src/swell/tasks/ingest_obs.py

src/swell/tasks/ingest_background.py

src/swell/utilities/scripts/search_ingested.py

mer-a-o · 2025-12-16T00:42:58Z

Thanks @ftgoktas and @ashiklom for the clarificaiton. dry-run makes sense now.

mer-a-o · 2025-12-16T00:53:39Z

2. In Ewok/Skylab's pattern, ingestion is a separate task (storeObservations) that runs after conversion (convertObservations). Swell already has conversion tasks (BufrToIoda), so IngestObs will complete the workflow by handling the R2D2 storage step.

@ftgoktas, In skylab, the observation ingest suite includes downloading (or copying) observations, running the ioda converter which is unique for each observations type (we don't always use BufrToIoda converter), and then storing to r2d2. In Swell, I don't see the conversion being part of the ingest_obs/suite_config.py right now.

Having the conversion task as part of the observation ingest suite would streamline the whole process. We can skip conversion if the files are already in ioda format. I assume you are going to add downloading the files from S3 buckets of ftp later. That would be a good place to add the conversion step to the suite.

…e/r2d2-ingest

ftgoktas added 30 commits July 8, 2025 13:06

Script to setup new r2d2 credentials

2980f48

Create a new swell task to test new r2d2

d502482

Adapt get_observations to new R2D2

29a42d8

Remove exit()

85d7124

Add r2d2 configs (#318)

88fd221

Update swell tasks to new R2D2 (#318)

da1c318

Update r2d2 version of save obs diagnostics #318

56c1c47

Remove unused files #318

0149e4a

Merge remote-tracking branch 'origin/develop' into feature/r2d2_v3

e83db93

Create r2d2 file register script #318

958ec06

Add scripts for manual setup for R2D2 #318

1240447

Clean up files (#318)

7c0be9d

Clean up the files (#318)

f8fc151

Update Python coding norms (#318)

a75bf9c

Fix pycode styles

c2b91d2

Remove redundant lines

f2c4d08

Load R2D2 credentials under TaskBase (#318)

6a614eb

Load credentials under create R2D2 config (#318)

591f570

make R2D2 host/compiler detection support dynamic (#318)

53ded1b

Add docs for credential setup (#318)

44c7b17

Update r2d2_config for cascade (#318)

d035da4

Move credentials under create_task (#318)

bbb953d

Move scripts under utilities (#318)

fa927ae

Fix pylint errors

98cf89d

Merge branch 'develop' of https://github.com/GEOS-ESM/swell into feat…

18d9e0d

…ure/r2d2_v3

Merge branch 'feature/r2d2_v3' of https://github.com/GEOS-ESM/swell i…

a1828f5

…nto feature/r2d2_v3

Fix AttributeError when fetching bias correction files (#318)

eafae45

Fix bias correction arguments (#318)

9384fc9

Fix bias correction argument (#318)

d380589

Add file type argument (#318)

0cf308c