Skip to content

Conversation

@ftgoktas
Copy link
Member

Description

This PR introduces a new observation ingestion suite for Swell, enabling automated ingestion of observation and background data into R2D2 v3 similar to Skylab/Ewok architecture.

Key Features:

  • Modular YAML configuration: Each observation type (e.g., adt_cryosat2n, adt_sentinel6a) has its own standalone YAML file with retrieval method and metadata.
  • R2D2 duplicate detection: Automatically checks if observations already exist in R2D2 before ingestion to avoid unnecessary copies.
  • Dry-run mode: Test ingestion workflows without actually storing files to R2D2.

Usage:
swell create ingest_obs_marine
swell launch <experiment_path>

The suite automatically ingests all observations listed in obs_to_ingest for each cycle point, skipping files already present in R2D2.

Copy link
Contributor

@ashiklom ashiklom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments. TLDR:

  • Glob logic can be simplified.
  • Add type hints
  • Don't catch generic Exception if we don't immediately re-raise it.

@mer-a-o
Copy link
Contributor

mer-a-o commented Dec 15, 2025

Thanks @ftgoktas for adding this feature. Some general comments before getting into details:

  • I don't understand the point of dry_run. Is it more like a search functionality? If yes, can we change the name to something else as dry_run in the context of cycling means something else.
  • I suggest adding observation ingest in a separate PR. How would you add running ioda-converter step to this task?
  • Please add a clear description about each of the options in the yaml.

@ftgoktas
Copy link
Member Author

Thanks @ftgoktas for adding this feature. Some general comments before getting into details:

  • I don't understand the point of dry_run. Is it more like a search functionality? If yes, can we change the name to something else as dry_run in the context of cycling means something else.
  • I suggest adding observation ingest in a separate PR. How would you add running ioda-converter step to this task?
  • Please add a clear description about each of the options in the yaml.
  1. dry_run mode validates files exist and logs what would be ingested, but skips r2d2.store(). It's for testing ingestion logic before committing data to R2D2.
    It's not a search functionality, it still runs the full task logic (file pattern matching, existence checks, metadata lookup), just without the final storage step. We can rename to preview_mode or test_ingest if dry_run is confusing in the Cylc context.

  2. In Ewok/Skylab's pattern, ingestion is a separate task (storeObservations) that runs after conversion (convertObservations). Swell already has conversion tasks (BufrToIoda), so IngestObs will complete the workflow by handling the R2D2 storage step.

@ashiklom
Copy link
Contributor

I don't understand the point of dry_run. Is it more like a search functionality? If yes, can we change the name to something else as dry_run in the context of cycling means something else.

FWIW, dry-run is a pretty common flag in software engineering with a widely understood definition of "print what the command would have done but don't actually do it". E.g., rsync, aws s3 sync, and lots of other commands that do big/many file modifications have a --dry-run option to first check that the files are going to end up where you expect them to before you actually move/copy/delete anything. So my vote would be to keep this name!

In the context of rsync: Can you predict exactly what each of these commands will do? They do different things, but are you 100% sure you know the result?

rsync -av path/to/source path/to/destination
rsync -av path/to/source path/to/destination/
rsync -av path/to/source/ path/to/destination
rsync -av path/to/source/ path/to/destination/

It's very easy to mess up a directory or even to delete files if you accidentally use the wrong combination of trailing slashes. rsync --dry-run is handy because it will print each individual copy operation it would do (and raise any errors/warnings it might encounter) but will not actually do the operations, so you can quickly confirm that you're doing what you intend.

@mer-a-o
Copy link
Contributor

mer-a-o commented Dec 16, 2025

Thanks @ftgoktas and @ashiklom for the clarificaiton. dry-run makes sense now.

@mer-a-o
Copy link
Contributor

mer-a-o commented Dec 16, 2025

2. In Ewok/Skylab's pattern, ingestion is a separate task (storeObservations) that runs after conversion (convertObservations). Swell already has conversion tasks (BufrToIoda), so IngestObs will complete the workflow by handling the R2D2 storage step.

@ftgoktas, In skylab, the observation ingest suite includes downloading (or copying) observations, running the ioda converter which is unique for each observations type (we don't always use BufrToIoda converter), and then storing to r2d2. In Swell, I don't see the conversion being part of the ingest_obs/suite_config.py right now.

Having the conversion task as part of the observation ingest suite would streamline the whole process. We can skip conversion if the files are already in ioda format. I assume you are going to add downloading the files from S3 buckets of ftp later. That would be a good place to add the conversion step to the suite.

@ftgoktas ftgoktas marked this pull request as ready for review January 30, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement R2D2 Ingestion Suite

5 participants