Skip to content

Conversation

@PGijsbers
Copy link
Contributor

@PGijsbers PGijsbers commented Dec 19, 2025

I played myself. Applying my commits to main instead and then closing the dev branch.

Summary by Sourcery

Align routing, URLs, and flow schemas with the legacy PHP API and make the Docker/CI setup spin up a complete test stack for Python and PHP services.

New Features:

  • Introduce configurable routing settings (server and MinIO URLs) and use them across dataset, parquet, and task URL formatting.
  • Add a database setup service and Elasticsearch service to docker-compose to bootstrap the test database and search stack.

Bug Fixes:

  • Fix flow subflow schema and API response structure to match the PHP API by inlining subflows instead of wrapping them with identifiers.
  • Correct dataset status and ID constants and tests to reflect actual database state and status transitions.
  • Normalize URL expectations in migration tests to account for port-less HTTP URLs and updated internal service hostnames.
  • Use the routing configuration for task base URLs instead of a hard-coded test URL.

Enhancements:

  • Refactor configuration loading into a shared helper and separate database and routing configuration accessors.
  • Simplify flow subflow retrieval by returning child flows directly from the database query.
  • Relax JsonLDGraph context typing to avoid HttpUrl type mismatches in JSON-LD serialization.

Build:

  • Update docker-compose to use versioned images, service profiles, health checks, and a database initialization script, and to standardize container names and environment configuration for PHP and Python APIs.
  • Add a database update script to rewrite file URLs to MinIO, sync dataset URLs, ensure required tables exist, and adjust dataset statuses for tests.
  • Tidy Docker docs by removing stale references to Elasticsearch8 prebuilt images and fixing the multi-platform build docs URL.
  • Document MySQL CLI connection parameters for the local test database container.

CI:

  • Update GitHub Actions to use newer checkout/setup-python actions, compose profiles, and renamed containers while running coverage in the Python API container and uploading to Codecov.

Documentation:

  • Clarify Docker and MySQL usage documentation and remove outdated instructions for building legacy Elasticsearch images.

Tests:

  • Adjust tests and migration checks to the new routing configuration, URL formats, flow subflow shape, dataset statuses, and PHP API base URL, including shared constants and fixtures.

Summary by CodeRabbit

  • New Features

    • Added Elasticsearch service integration support
  • Refactor

    • Updated API endpoint URLs configuration for consistency across services
    • Improved Docker deployment with health checks and service profiles
    • Enhanced configuration loading with environment variable overrides
  • Tests

    • Updated test data and fixtures to align with new endpoint configuration
  • Documentation

    • Updated Docker setup documentation and examples

✏️ Tip: You can customize this high-level summary in your review settings.

Currently still maintain the relevant definition files in this repository to allow them to change independently for a little while when the server is under most active development. We can then consider which changes should be merged to services to reduce duplication again.
@coderabbitai
Copy link

coderabbitai bot commented Dec 19, 2025

Walkthrough

The PR updates the CI/CD pipeline to use GitHub Actions v6, refactors Docker orchestration with service profiles and healthchecks, introduces environment-driven configuration for routing URLs, restructures flow subflows from dict-based to direct Flow objects, and synchronizes test data with updated service URLs and database states.

Changes

Cohort / File(s) Summary
CI/CD and GitHub Actions
.github/workflows/tests.yml
Upgraded actions/checkout and actions/setup-python to v6; replaced sequential docker-compose steps with profile-based services (python, php) using docker compose up with detach/wait flags; updated container targets for test execution.
Docker Infrastructure
docker-compose.yaml
Added database healthcheck and database-setup service with schema initialization; introduced elasticsearch service with healthcheck and resource limits; applied profiles ("python", "php", "all") to api services; updated php-api image to v1.2.2 and python-api container name.
Docker Configuration & Setup
docker/database/update.sh, docker/elasticsearch/.env, docker/mysql/README.md, docker/php/.env
New database update script with MySQL operations for file path synchronization and table creation; new elasticsearch environment configuration (single-node, security disabled); PHP environment variables for API, database, and Elasticsearch connectivity; MySQL client connection documentation.
Docker Documentation
docker/readme.md
Removed elasticsearch8-prebuilt and obsolete php-api references; updated multi-platform build instructions with docker buildx and optional Hub push option.
Configuration System
src/config.py, src/config.toml
Added CONFIG_PATH constant, _load_configuration helper, load_routing_configuration, and enhanced load_configuration; updated load_database_configuration to use dotenv and environment variable overrides; introduced [routing] section in config with minio_url and server_url.
API Routing & Formatting
src/core/formatting.py, src/routers/openml/tasks.py
Replaced hardcoded URLs with dynamic resolution from load_routing_configuration; updated URL construction logic for parquet and dataset endpoints; refactored base URL handling in task template substitution.
Flow Schema & API
src/schemas/flows.py, src/routers/openml/flows.py
Changed Flow.subflows from list[Subflow] to list[Self]; removed Subflow TypedDict; updated get_flow to build subflows as flat Flow object list; changed get_subflows parameter from for_flow (keyword) to flow_id (positional).
Main & Misc Code
src/main.py, src/schemas/datasets/mldcat_ap.py
Assigned add_argument return values to dummy variable in argparse setup; added type: ignore[arg-type] to JsonLDGraph.context field.
Test Fixtures & Constants
tests/conftest.py, tests/constants.py
Added PHP_API_URL module constant; refactored php_api fixture to return Client directly with configurable base_url; updated IN_PREPARATION_ID from {1, 33} to {33} and DEACTIVATED_DATASETS from {2, 131} to {131}.
Test Data & Assertions
tests/routers/openml/datasets_test.py, tests/routers/openml/flows_test.py, tests/routers/openml/task_test.py, tests/routers/openml/migration/datasets_migration_test.py, tests/routers/openml/migration/flows_migration_test.py
Updated expected URLs to reflect php-api and minio service endpoints; changed dataset status to "active" and IDs to use constants; flattened subflow structure removing nested "identifier" and "flow" keys; refactored flow migration subflow recursion to direct in-place calls with post-processing normalization; normalized port in URLs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Docker and configuration system refactoring: Verify dotenv integration in config loading, database healthcheck logic, and service profile dependencies.
  • Flow/subflows structural change: Confirm that the schema change from dict-based to direct Flow object subflows is consistent across routers, tests, and serialization.
  • URL routing migration: Validate that hardcoded URLs are fully replaced with config-driven values and environment fallbacks work correctly.
  • Test data synchronization: Review test assertions for updated database IDs and service URLs to ensure they align with docker-compose profiles and fixtures.
  • Migration test logic in flows_migration_test.py: The refactored subflow recursion and component field normalization require careful review of the control flow change.

Poem

🐰 Our services now dance in profiles so neat,
Routing configuration, a synchronous feat,
Subflows flattened like clover-leaf beds,
URLs dynamic—no hardcoding dread!
Tests realigned with the docker compose crew. 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.55% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Move extra dev commits into main' is vague and does not clearly describe the substantive changes in this PR, which include infrastructure updates, configuration refactoring, API modifications, and test adjustments across numerous files. Provide a more specific title that highlights the main change, such as 'Update GitHub Actions, Docker configuration, and routing setup' or similar based on the primary objective.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev-rebased

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai
Copy link

sourcery-ai bot commented Dec 19, 2025

Reviewer's Guide

Aligns the Python REST API’s routing, flow/task schemas, and tests with the PHP API and new Docker/CI stack by centralizing routing configuration, simplifying flow subflow representation, updating dataset/task URL generation, and modernizing docker-compose and GitHub Actions workflows.

Sequence diagram for recursive flow retrieval with inlined subflows

sequenceDiagram
  actor Client
  participant PythonAPI as Python_API
  participant DB as Database

  Client->>PythonAPI: GET /flows/{flow_id}
  activate PythonAPI
  PythonAPI->>DB: get_flow(flow_id)
  DB-->>PythonAPI: FlowRow
  PythonAPI->>DB: get_subflows(flow_id)
  DB-->>PythonAPI: SubflowRows

  loop For each subflow in SubflowRows
    PythonAPI->>PythonAPI: get_flow(child_id)
    note over PythonAPI: Recursive call building Flow
  end

  PythonAPI-->>Client: Flow (subflows: list[Flow])
  deactivate PythonAPI
Loading

Updated class diagram for config and flow schemas

classDiagram
  direction LR

  class ConfigModule {
    +CONFIG_PATH : Path
    +_load_configuration(file Path) TomlTable
    +load_routing_configuration(file Path) TomlTable
    +load_database_configuration(file Path) TomlTable
  }

  class Flow {
    +int id_
    +int uploader
    +datetime created
    +str name
    +str full_name
    +str external_version
    +str description
    +str model
    +str[] components
    +str[] parameters_instructions
    +str[] parameters_meta_info
    +str binary_url
    +str binary_format
    +str library
    +str language
    +str dependencies
    +Parameter[] parameter
    +Flow[] subflows
    +str[] tag
  }

  class Parameter {
    +str name
    +str data_type
    +str default_value
    +bool is_required
  }

  class JsonLDGraph {
    +str | dict~str, HttpUrl~ context
    +list~Distribution | DataService | Dataset | Quality | Feature | Agent | MD5Checksum~ graph
  }

  class Subflow {
    <<removed>>
    +str identifier
    +Flow flow
  }

  ConfigModule <.. Flow : used_by
  ConfigModule <.. JsonLDGraph : type_config
  Flow "1" o-- "*" Parameter : has
  Flow "1" o-- "*" Flow : subflows
Loading

File-Level Changes

Change Details Files
Centralize configuration loading and introduce routing configuration used by URL and task formatting.
  • Extract shared _load_configuration helper and CONFIG_PATH constant in src/config.py
  • Add load_routing_configuration to expose [routing] section from config.toml
  • Update load_database_configuration to use shared loader
  • Extend src/config.toml with [routing] section containing minio_url and server_url values
src/config.py
src/config.toml
Align dataset and task URL formatting and related tests with the new routing (PHP API and MinIO).
  • Change parquet URL and dataset URL formatting to use routing configuration and updated base URLs
  • Switch dataset test expectations to http://php-api and http://minio URLs and adjust expected status to active
  • Update task test to expect PHP API-based data_splits_url
  • Normalize HttpURL serialization behavior in migration tests by stripping :80 and relaxing JsonLDGraph context typing
src/core/formatting.py
tests/routers/openml/datasets_test.py
tests/routers/openml/task_test.py
tests/routers/openml/migration/datasets_migration_test.py
src/schemas/datasets/mldcat_ap.py
Simplify flow subflow handling to be recursive Flow objects instead of wrapped Subflow structures and align migration tests.
  • Change Flow schema subflows field to list[Flow] and remove Subflow TypedDict
  • Update flows router to fetch subflows directly and return nested Flow instances
  • Adjust flow tests to expect bare subflow Flow objects instead of identifier/flow wrappers
  • Update migration flow comparison to treat subflows as components and adapt to PHP API component structure
src/schemas/flows.py
src/routers/openml/flows.py
tests/routers/openml/flows_test.py
tests/routers/openml/migration/flows_migration_test.py
Modernize docker-compose stack with profiles, health checks, dedicated DB setup container, and service renaming.
  • Add compose profiles for python, php, and all workflows
  • Version images, add health checks and resource limits, and rename services/containers (e.g., openml-python-rest-api, openml-php-rest-api, openml-elasticsearch)
  • Introduce database-setup service that runs docker/database/update.sh to patch MySQL schema/data
  • Provide Elasticsearch and PHP env files and remove legacy elasticsearch service definition
docker-compose.yaml
docker/database/update.sh
docker/elasticsearch/.env
docker/php/.env
Update CI workflows to use newer GitHub Actions and compose profiles with health-aware startup.
  • Upgrade actions/checkout and actions/setup-python to v6
  • Replace manual docker build/up and container names with compose --profile usage and new container names
  • Use compose --wait with an exit-code aggregation workaround to fail fast when services fail to start
  • Run pytest inside openml-python-rest-api container for both php_api and non-php_api test suites and generate coverage
.github/workflows/tests.yml
Align tests and constants with new dataset IDs, routing, and PHP API base URL.
  • Adjust dataset status tests to use shared IN_PREPARATION_ID and DEACTIVATED_DATASETS constants and update sets
  • Introduce shared PHP_API_URL constant for tests and simplify php_api fixture to use new container name
  • Fix private dataset migration test to reference tests.constants.PRIVATE_DATASET_ID
  • Tweak MySQL helper README to show concrete mysql connection command
tests/constants.py
tests/conftest.py
tests/routers/openml/datasets_test.py
tests/routers/openml/migration/datasets_migration_test.py
docker/mysql/README.md
Miscellaneous cleanups to CLI argument parsing and Docker docs.
  • Ignore unused return values from argparse add_argument calls in src/main.py
  • Simplify Docker README by removing prebuilt Elasticsearch image build instructions and updating multi-platform docs link
src/main.py
docker/readme.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In docker/database/update.sh, the shebang is currently #/bin/bash; this should be #!/bin/bash so the script runs correctly when executed directly.
  • Unlike load_database_configuration, load_routing_configuration is not cached and calls _load_configuration on every use; if this is invoked frequently at runtime, consider applying functools.cache (or reusing the cached config) to avoid repeated file reads and parsing.
  • In tests/routers/openml/migration/datasets_migration_test.py, the line [private_dataset] = tests.constants.PRIVATE_DATASET_ID now references tests.constants, but only constants appears to be imported previously; either import tests or revert to using the existing constants alias to avoid a NameError.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `docker/database/update.sh`, the shebang is currently `#/bin/bash`; this should be `#!/bin/bash` so the script runs correctly when executed directly.
- Unlike `load_database_configuration`, `load_routing_configuration` is not cached and calls `_load_configuration` on every use; if this is invoked frequently at runtime, consider applying `functools.cache` (or reusing the cached config) to avoid repeated file reads and parsing.
- In `tests/routers/openml/migration/datasets_migration_test.py`, the line `[private_dataset] = tests.constants.PRIVATE_DATASET_ID` now references `tests.constants`, but only `constants` appears to be imported previously; either import `tests` or revert to using the existing `constants` alias to avoid a `NameError`.

## Individual Comments

### Comment 1
<location> `docker-compose.yaml:13-16` </location>
<code_context>
+    healthcheck:
+      test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
+      start_period: 30s
+      start_interval: 1s
+      timeout: 3s
+      interval: 5s
+      retries: 10
+
+  database-setup:
</code_context>

<issue_to_address>
**issue (bug_risk):** The `start_interval` key is not a valid Docker healthcheck option and will be ignored.

Docker healthchecks only support `test`, `interval`, `timeout`, `retries`, and `start_period`; `start_interval` is ignored. To control check frequency, use `interval` (and optionally `start_period`) instead, and remove `start_interval` here and in the Elasticsearch service or replace it with a supported option.
</issue_to_address>

### Comment 2
<location> `docker-compose.yaml:51-53` </location>
<code_context>
+    healthcheck:
+      test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
+      start_period: 30s
+      start_interval: 5s
+      timeout: 3s
+      interval: 10s
+    deploy:
+      resources:
</code_context>

<issue_to_address>
**issue (bug_risk):** Same `start_interval` issue in the Elasticsearch healthcheck configuration.

As with the database service, `start_interval` isn’t a valid Docker healthcheck option and will be ignored. To control timing, rely on `start_period` with `interval`, or just adjust `interval`.
</issue_to_address>

### Comment 3
<location> `docker/database/update.sh:5-8` </location>
<code_context>
+# Change the filepath of openml.file
+# from "https://www.openml.org/data/download/1666876/phpFsFYVN"
+# to "http://minio:9000/datasets/0000/0001/phpFsFYVN"
+mysql -hdatabase -uroot -pok -e 'UPDATE openml.file SET filepath = CONCAT("http://minio:9000/datasets/0000/", LPAD(id, 4, "0"), "/", SUBSTRING_INDEX(filepath, "/", -1)) WHERE extension="arff";'
+
+# Update openml.expdb.dataset with the same url
+mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;'
+
+
</code_context>

<issue_to_address>
**issue (bug_risk):** The MinIO dataset URL pattern here is inconsistent with the application’s routing logic and will fail for larger dataset IDs.

In `src/core/formatting.py`, dataset URLs use a computed ten-thousands prefix (`dataset.did // 10_000:04d`) plus a padded ID. This SQL instead hardcodes `datasets/0000/` and only pads `id` to 4 digits, so `id >= 10000` will map to the wrong path. If this script should work for all datasets, please compute both the prefix and padded ID in SQL (e.g., using integer division and `LPAD`) to match the Python logic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +13 to +16
start_interval: 1s
timeout: 3s
interval: 5s
retries: 10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The start_interval key is not a valid Docker healthcheck option and will be ignored.

Docker healthchecks only support test, interval, timeout, retries, and start_period; start_interval is ignored. To control check frequency, use interval (and optionally start_period) instead, and remove start_interval here and in the Elasticsearch service or replace it with a supported option.

Comment on lines +51 to +53
start_interval: 5s
timeout: 3s
interval: 10s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Same start_interval issue in the Elasticsearch healthcheck configuration.

As with the database service, start_interval isn’t a valid Docker healthcheck option and will be ignored. To control timing, rely on start_period with interval, or just adjust interval.

Comment on lines +5 to +8
mysql -hdatabase -uroot -pok -e 'UPDATE openml.file SET filepath = CONCAT("http://minio:9000/datasets/0000/", LPAD(id, 4, "0"), "/", SUBSTRING_INDEX(filepath, "/", -1)) WHERE extension="arff";'

# Update openml.expdb.dataset with the same url
mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The MinIO dataset URL pattern here is inconsistent with the application’s routing logic and will fail for larger dataset IDs.

In src/core/formatting.py, dataset URLs use a computed ten-thousands prefix (dataset.did // 10_000:04d) plus a padded ID. This SQL instead hardcodes datasets/0000/ and only pads id to 4 digits, so id >= 10000 will map to the wrong path. If this script should work for all datasets, please compute both the prefix and padded ID in SQL (e.g., using integer division and LPAD) to match the Python logic.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/config.py (1)

32-55: Critical caching bug: environment variables are ignored after first call.

The function is decorated with @functools.cache, but it reads environment variables (lines 39-54) and calls load_dotenv() (line 38) inside the cached function. After the first invocation, the cached result is returned without re-reading environment variables or re-executing load_dotenv().

This breaks environment-based configuration:

# First call
config = load_database_configuration()  # Uses current env vars

# Environment changes
os.environ["OPENML_DATABASES_OPENML_USERNAME"] = "new_user"

# Second call returns stale cached result
config = load_database_configuration()  # Still has old username ❌
🔎 Recommended fix: Remove caching or restructure

Option 1 (Recommended): Remove caching from this function

-@functools.cache
 def load_database_configuration(file: Path = CONFIG_PATH) -> TomlTable:
     configuration = _load_configuration(file)

Option 2: Cache only the file-based configuration, apply env overrides separately

@functools.cache
def _load_base_database_configuration(file: Path) -> TomlTable:
    configuration = _load_configuration(file)
    return _apply_defaults_to_siblings(configuration["databases"])

def load_database_configuration(file: Path = CONFIG_PATH) -> TomlTable:
    database_configuration = _load_base_database_configuration(file).copy()
    load_dotenv()
    database_configuration["openml"]["username"] = os.environ.get(
        "OPENML_DATABASES_OPENML_USERNAME", "root"
    )
    # ... rest of env overrides
    return database_configuration

Option 2 maintains performance benefits of caching the file I/O while correctly handling environment variable overrides on each call.

🧹 Nitpick comments (13)
src/schemas/datasets/mldcat_ap.py (1)

278-278: Consider fixing the type mismatch instead of suppressing it.

The # type: ignore[arg-type] suppresses a legitimate type error: default_factory=dict returns an untyped dict, not dict[str, HttpUrl]. While this works at runtime, it reduces type safety.

🔎 Proposed fix to properly type the default factory
-    context: str | dict[str, HttpUrl] = Field(default_factory=dict, serialization_alias="@context")  # type: ignore[arg-type]
+    context: str | dict[str, HttpUrl] = Field(default_factory=lambda: {}, serialization_alias="@context")

Alternatively, if the context should always be a properly typed dict when initialized as a dict:

-    context: str | dict[str, HttpUrl] = Field(default_factory=dict, serialization_alias="@context")  # type: ignore[arg-type]
+    context: str | dict[str, HttpUrl] = Field(default_factory=lambda: dict[str, HttpUrl](), serialization_alias="@context")
src/routers/openml/tasks.py (1)

143-144: Consider caching the routing configuration to avoid repeated file I/O.

The load_routing_configuration() call is executed within _fill_json_template, which is called recursively for nested template structures. This means the configuration file is read from disk multiple times during a single request.

For better performance, consider loading the routing configuration once at the module level, or passing it as a parameter through the call chain.

🔎 Proposed refactor options

Option 1: Module-level constant (if config is static during runtime)

+_ROUTING_CONFIG = config.load_routing_configuration()
+
 def _fill_json_template(
     template: dict[str, Any],
     task: RowMapping,
     task_inputs: dict[str, str],
     fetched_data: dict[str, Any],
     connection: Connection,
 ) -> dict[str, Any] | list[dict[str, Any]] | str:
     # ... existing code ...
     template = template.replace("[TASK:id]", str(task.task_id))
-    server_url = config.load_routing_configuration()["server_url"]
+    server_url = _ROUTING_CONFIG["server_url"]
     return template.replace("[CONSTANT:base_url]", server_url)

Option 2: Pass as parameter (if config might change)

 def fill_template(
     template: str,
     task: RowMapping,
     task_inputs: dict[str, str],
     connection: Connection,
 ) -> Any:  # noqa: ANN401
     json_template = convert_template_xml_to_json(template)
+    routing_config = config.load_routing_configuration()
     return _fill_json_template(
         json_template,
         task,
         task_inputs,
         fetched_data={},
         connection=connection,
+        routing_config=routing_config,
     )
.github/workflows/tests.yml (2)

24-24: Consider shell script improvements for robustness.

The static analysis tool suggests improvements to the shell command for better reliability.

🔎 Proposed improvements
-      - run: docker compose --profile "python" --profile "php" up --detach --wait --remove-orphans || exit $(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -v '^0' | wc -l)
+      - run: docker compose --profile "python" --profile "php" up --detach --wait --remove-orphans || exit "$(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -vc '^0')"

Changes:

  • Quote the command substitution to prevent word splitting
  • Use grep -c instead of grep | wc -l for cleaner syntax

38-38: Consider shell script improvements for robustness.

The static analysis tool suggests improvements to the shell command for better reliability.

🔎 Proposed improvements
-      - run: docker compose --profile "python" up --detach --wait --remove-orphans || exit $(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -v '^0' | wc -l)
+      - run: docker compose --profile "python" up --detach --wait --remove-orphans || exit "$(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -vc '^0')"

Changes:

  • Quote the command substitution to prevent word splitting
  • Use grep -c instead of grep | wc -l for cleaner syntax
src/core/formatting.py (2)

28-31: Consider normalizing trailing slashes in URL construction.

If minio_url from config doesn't end with /, the resulting URL will be malformed (e.g., http://minio:9000datasets/...). The same applies to server_url on line 35-37.

Proposed fix to ensure trailing slash
 def _format_parquet_url(dataset: Row) -> str | None:
     if dataset.format.lower() != DatasetFileFormat.ARFF:
         return None

-    minio_base_url = load_routing_configuration()["minio_url"]
+    minio_base_url = load_routing_configuration()["minio_url"].rstrip("/") + "/"
     ten_thousands_prefix = f"{dataset.did // 10_000:04d}"
     padded_id = f"{dataset.did:04d}"
     return f"{minio_base_url}datasets/{ten_thousands_prefix}/{padded_id}/dataset_{dataset.did}.pq"

28-31: Repeated config loading on every call.

load_routing_configuration() is called on each invocation of these formatting functions. If these are called frequently (e.g., in a loop formatting many datasets), consider caching the config or passing it as a parameter to avoid repeated file I/O.

Also applies to: 35-37

docker/database/update.sh (2)

1-8: Add error handling to the script.

The script lacks error handling. If any MySQL command fails, subsequent commands will still execute, potentially leaving the database in an inconsistent state.

Proposed fix
 #!/bin/bash
+set -e  # Exit immediately if a command exits with a non-zero status
+
 # Change the filepath of openml.file
 # from "https://www.openml.org/data/download/1666876/phpFsFYVN"
 # to "http://minio:9000/datasets/0000/0001/phpFsFYVN"

10-13: Remove excessive blank lines.

Multiple consecutive blank lines reduce readability.

Proposed fix
 mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;'

-
-
-
-
 # Create the data_feature_description TABLE. TODO: can we make sure this table exists already?
src/routers/openml/flows.py (1)

52-53: N+1 query pattern for subflows.

Each subflow triggers a separate get_flow call with its own database queries (flow, parameters, tags, subflows). For flows with many nested subflows, this could result in significant database load. Consider batching queries or lazy loading if performance becomes an issue.

docker-compose.yaml (2)

70-74: Inconsistent depends_on conditions.

php-api uses condition: service_healthy for elasticsearch but condition: service_started for database, even though database now has a healthcheck. Consider using service_healthy for database as well to ensure it's ready before php-api starts.

Proposed fix
     depends_on:
       elasticsearch:
         condition: service_healthy
       database:
-        condition: service_started
+        condition: service_healthy

92-93: python-api depends_on lacks explicit condition.

Unlike other services, python-api uses a simple list syntax for depends_on instead of explicit conditions. Consider adding condition: service_healthy for consistency with the database healthcheck.

Proposed fix
     depends_on:
-      - database
+      database:
+        condition: service_healthy
src/config.py (2)

28-29: Consider adding error handling for missing configuration section.

Direct dictionary access will raise a generic KeyError if the "routing" section is missing from config.toml. Consider wrapping with a try-except to provide a more descriptive error message.

🔎 Proposed enhancement for better error messages
 def load_routing_configuration(file: Path = CONFIG_PATH) -> TomlTable:
-    return typing.cast(TomlTable, _load_configuration(file)["routing"])
+    config = _load_configuration(file)
+    if "routing" not in config:
+        raise ValueError(f"Missing 'routing' section in configuration file: {file}")
+    return typing.cast(TomlTable, config["routing"])

58-59: Consider using the new _load_configuration helper for consistency.

This function directly calls tomllib.loads(file.read_text()), while the new pattern introduced in this PR uses the cached _load_configuration helper. Additionally, it defines the default path inline rather than using the CONFIG_PATH constant.

🔎 Suggested refactor for consistency
-def load_configuration(file: Path = Path(__file__).parent / "config.toml") -> TomlTable:
-    return tomllib.loads(file.read_text())
+def load_configuration(file: Path = CONFIG_PATH) -> TomlTable:
+    return _load_configuration(file)

This improves consistency across the module and leverages the caching benefit of _load_configuration.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 217d62d and 813927a.

📒 Files selected for processing (23)
  • .github/workflows/tests.yml (1 hunks)
  • .gitignore (1 hunks)
  • docker-compose.yaml (3 hunks)
  • docker/database/update.sh (1 hunks)
  • docker/elasticsearch/.env (1 hunks)
  • docker/mysql/README.md (1 hunks)
  • docker/php/.env (1 hunks)
  • docker/readme.md (1 hunks)
  • src/config.py (2 hunks)
  • src/config.toml (1 hunks)
  • src/core/formatting.py (2 hunks)
  • src/main.py (1 hunks)
  • src/routers/openml/flows.py (1 hunks)
  • src/routers/openml/tasks.py (2 hunks)
  • src/schemas/datasets/mldcat_ap.py (1 hunks)
  • src/schemas/flows.py (2 hunks)
  • tests/conftest.py (2 hunks)
  • tests/constants.py (1 hunks)
  • tests/routers/openml/datasets_test.py (5 hunks)
  • tests/routers/openml/flows_test.py (1 hunks)
  • tests/routers/openml/migration/datasets_migration_test.py (3 hunks)
  • tests/routers/openml/migration/flows_migration_test.py (2 hunks)
  • tests/routers/openml/task_test.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/routers/openml/tasks.py (1)
src/config.py (1)
  • load_routing_configuration (28-29)
src/core/formatting.py (1)
src/config.py (1)
  • load_routing_configuration (28-29)
src/routers/openml/flows.py (2)
src/database/flows.py (1)
  • get_subflows (7-20)
tests/conftest.py (1)
  • flow (79-89)
🪛 actionlint (1.7.9)
.github/workflows/tests.yml

24-24: shellcheck reported issue in this script: SC2046:warning:1:95: Quote this to prevent word splitting

(shellcheck)


24-24: shellcheck reported issue in this script: SC2126:style:1:168: Consider using 'grep -c' instead of 'grep|wc -l'

(shellcheck)


38-38: shellcheck reported issue in this script: SC2046:warning:1:79: Quote this to prevent word splitting

(shellcheck)


38-38: shellcheck reported issue in this script: SC2126:style:1:152: Consider using 'grep -c' instead of 'grep|wc -l'

(shellcheck)

🪛 dotenv-linter (4.0.0)
docker/php/.env

[warning] 4-4: [UnorderedKey] The DB_HOST_OPENML key should go before the MINIO_URL key

(UnorderedKey)


[warning] 5-5: [UnorderedKey] The DB_HOST_EXPDB key should go before the DB_HOST_OPENML key

(UnorderedKey)


[warning] 6-6: [UnorderedKey] The DB_USER_OPENML key should go before the MINIO_URL key

(UnorderedKey)


[warning] 7-7: [UnorderedKey] The DB_PASS_OPENML key should go before the DB_USER_OPENML key

(UnorderedKey)


[warning] 8-8: [UnorderedKey] The DB_USER_EXPDB_WRITE key should go before the DB_USER_OPENML key

(UnorderedKey)


[warning] 9-9: [UnorderedKey] The DB_PASS_EXPDB_WRITE key should go before the DB_PASS_OPENML key

(UnorderedKey)


[warning] 10-10: [UnorderedKey] The DB_USER_EXPDB_READ key should go before the DB_USER_EXPDB_WRITE key

(UnorderedKey)


[warning] 11-11: [UnorderedKey] The DB_PASS_EXPDB_READ key should go before the DB_PASS_EXPDB_WRITE key

(UnorderedKey)


[warning] 12-12: [UnorderedKey] The ES_URL key should go before the MINIO_URL key

(UnorderedKey)


[warning] 13-13: [UnorderedKey] The ES_PASSWORD key should go before the ES_URL key

(UnorderedKey)


[warning] 14-14: [UnorderedKey] The INDEX_ES_DURING_STARTUP key should go before the MINIO_URL key

(UnorderedKey)

docker/elasticsearch/.env

[warning] 2-2: [IncorrectDelimiter] The discovery.type key has incorrect delimiter

(IncorrectDelimiter)


[warning] 2-2: [LowercaseKey] The discovery.type key should be in uppercase

(LowercaseKey)


[warning] 3-3: [IncorrectDelimiter] The xpack.security.enabled key has incorrect delimiter

(IncorrectDelimiter)


[warning] 3-3: [LowercaseKey] The xpack.security.enabled key should be in uppercase

(LowercaseKey)

🪛 Shellcheck (0.11.0)
docker/database/update.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Sourcery review
  • GitHub Check: Sourcery review
  • GitHub Check: compare-php
🔇 Additional comments (24)
.gitignore (1)

2-2: LGTM!

Adding .DS_Store to the ignore list is a standard best practice. macOS system files should never be tracked in version control.

docker/mysql/README.md (1)

19-19: LGTM!

The concrete MySQL connection example improves the documentation by showing users exactly how to connect to the test database.

docker/elasticsearch/.env (1)

1-3: LGTM! Static analysis warnings are false positives.

The dotenv-linter warnings about uppercase keys and delimiters can be safely ignored. These are Elasticsearch-specific configuration keys that must use the exact case and format shown (e.g., discovery.type, xpack.security.enabled). The disabled security settings are appropriate for a test/development environment.

docker/readme.md (1)

17-17: LGTM!

The updated link to Docker's official documentation is more concise and maintainable than the previous longer URL.

src/schemas/flows.py (2)

28-28: The recursive Flow reference is well-typed.

Using Self for recursive Flow references is the appropriate pattern and makes the schema cleaner than the previous TypedDict approach.


4-4: Structural change to subflows is complete and all consumers have been updated.

The change from list[Subflow] (TypedDict with "identifier" and "flow" keys) to list[Self] (direct Flow objects) is confirmed as a breaking API change. Verification shows all production code and tests have been properly updated:

  • Router builds subflows recursively via get_flow() calls, creating nested Flow objects
  • No old dictionary access patterns remain in the codebase
  • All unit tests expect and validate the new nested Flow object structure
  • Migration test properly handles conversion between old ("component"/"flow" format) and new structure for backward compatibility

The implementation is complete and consistent across the codebase.

tests/routers/openml/task_test.py (1)

22-22: LGTM!

The updated URL correctly reflects the new routing configuration using the internal Docker service name (http://php-api:80/) instead of the external production URL. This aligns with the routing config introduced in src/config.toml.

src/main.py (1)

24-40: LGTM!

Assigning the add_argument return values to _ is a common Python idiom to explicitly indicate these values are intentionally unused. This likely addresses linter warnings while maintaining the same functional behavior.

tests/constants.py (1)

2-3: LGTM! Test constants updated to reflect current database state.

The removal of dataset IDs 1 and 2 from IN_PREPARATION_ID and DEACTIVATED_DATASETS respectively aligns with the test data changes in tests/routers/openml/datasets_test.py where dataset 1 is now marked as "active" (line 55).

tests/routers/openml/migration/datasets_migration_test.py (2)

8-8: LGTM! Import standardization improves consistency.

The updated import to use the fully qualified tests.constants module path aligns with the pattern used across other test files and improves code clarity.

Also applies to: 133-133


46-47: Good addition of explanatory comment for URL normalization.

The inline comment clearly explains why the port is being removed, documenting Pydantic's HttpURL serialization behavior for future maintainers.

tests/routers/openml/datasets_test.py (2)

48-49: LGTM! Test data updated to reflect routing configuration and database state.

The URL updates to use service names (http://php-api and http://minio:9000) align with the new routing configuration approach. The status change from "in_preparation" to "active" for dataset 1 is consistent with its removal from IN_PREPARATION_ID in tests/constants.py.

Also applies to: 55-55


226-226: LGTM! Using constants improves test maintainability.

The use of next(iter(constants.IN_PREPARATION_ID)) and next(iter(constants.DEACTIVATED_DATASETS)) correctly extracts the single dataset ID from each set, making tests more maintainable when test data changes.

Also applies to: 236-236, 246-246

tests/conftest.py (1)

19-19: LGTM! Centralized PHP API URL improves maintainability.

Defining PHP_API_URL as a module-level constant makes it easier to update the URL in one place and improves consistency across fixtures.

tests/routers/openml/flows_test.py (1)

224-358: LGTM! Subflow structure simplified for better API ergonomics.

The flattened subflow structure (removing the nested {"identifier": None, "flow": {...}} wrapper) provides a cleaner, more intuitive API response. This aligns with the backend changes where get_subflows now returns Flow objects directly.

.github/workflows/tests.yml (2)

25-26: LGTM! Updated to use the correct Python API container.

The container reference has been updated from a generic name to openml-python-rest-api, which aligns with the docker-compose service definitions.

Also applies to: 39-40


18-19: Both actions/checkout@v6 and actions/setup-python@v6 are available and widely documented. The checkout@v6 version stores credentials under $RUNNER_TEMP instead of the git config, requiring a minimum Actions Runner version of v2.329.0 for compatibility. Verify your runner meets this requirement; otherwise no special compatibility considerations apply.

docker/database/update.sh (1)

5-5: Hardcoded URL should align with routing configuration.

The minio URL http://minio:9000/ is hardcoded here but also defined in src/config.toml as minio_url. Consider either documenting this dependency or extracting the URL to avoid drift between the database script and the application config.

tests/routers/openml/migration/flows_migration_test.py (2)

63-72: LGTM on the recursive conversion logic.

The in-place recursive call to convert_flow_naming_and_defaults(subflow) correctly handles the new flat Flow structure where subflows are Flow objects instead of dicts with "identifier" and "flow" keys.


78-79: Remove invalid concern about single vs. multiple subflows handling.

Lines 78-79 correctly normalize the PHP API's single-component dict structure by extracting the nested "flow" value. The code properly handles the PHP API response format where component is a dict with a "flow" key, not a list. The iteration over multiple subflows occurs in the new API's "subflows" list (line 67), which is distinct from the PHP API's "component" field. The nested_remove_single_element_list call on line 75 then normalizes single-element lists appropriately. No issue exists with this normalization.

docker-compose.yaml (1)

18-27: LGTM on database-setup service.

The service correctly waits for database to be healthy before running the setup script, and the volume mount provides the script at the expected path.

src/routers/openml/flows.py (1)

52-53: Identifier field exclusion from subflows is intentional.

The identifier field is retrieved from the database in get_subflows() but intentionally not included in the Flow response schema. This change is confirmed by the migration test, which explicitly notes the identifier is omitted (similar to how parameter default values are handled). No code in the repository references subflow identifiers, confirming no downstream consumers are affected by this omission.

src/config.py (2)

11-12: LGTM!

Clean introduction of a module-level constant for configuration file path using proper Path handling.


24-26: LGTM!

Excellent use of caching for file I/O operations. The cached helper reduces redundant file reads when the same configuration file is accessed multiple times.

Comment on lines +48 to +53
healthcheck:
test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
start_period: 30s
start_interval: 5s
timeout: 3s
interval: 10s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Elasticsearch healthcheck may be too strict for single-node clusters.

The healthcheck only accepts "green" status, but single-node Elasticsearch clusters typically report "yellow" because replicas cannot be allocated. This may cause the healthcheck to fail indefinitely.

Proposed fix to accept both green and yellow
     healthcheck:
-      test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
+      test: curl -s 127.0.0.1:9200/_cluster/health | grep -E '"status":"(green|yellow)"'
       start_period: 30s
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
healthcheck:
test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
start_period: 30s
start_interval: 5s
timeout: 3s
interval: 10s
healthcheck:
test: curl -s 127.0.0.1:9200/_cluster/health | grep -E '"status":"(green|yellow)"'
start_period: 30s
start_interval: 5s
timeout: 3s
interval: 10s
🤖 Prompt for AI Agents
In docker-compose.yaml around lines 48 to 53, the healthcheck currently only
accepts "green" which causes single-node Elasticsearch clusters reporting
"yellow" to fail; update the healthcheck command to consider both "green" and
"yellow" as healthy (for example use a silent curl and a regex match for
green|yellow or parse the JSON "status" field and check it is either "green" or
"yellow") so the healthcheck returns success for single-node scenarios while
still failing for red.

@@ -0,0 +1,31 @@
#/bin/bash
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the shebang syntax.

The shebang is missing the ! character. This will cause the script to not execute with bash semantics and may fail or behave unexpectedly.

Proposed fix
-#/bin/bash
+#!/bin/bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#/bin/bash
#!/bin/bash
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

🤖 Prompt for AI Agents
In docker/database/update.sh around line 1, the shebang is written as
"#/bin/bash" and is missing the "!" — replace the incorrect shebang with a valid
bash shebang (#!/bin/bash) on the first line and ensure the script is executable
(chmod +x) so it runs with bash semantics.

Comment on lines +1 to +14
API_KEY=AD000000000000000000000000000000
BASE_URL=http://php-api:80/
MINIO_URL=http://minio:9000/
DB_HOST_OPENML=database:3306
DB_HOST_EXPDB=database:3306
DB_USER_OPENML=root
DB_PASS_OPENML=ok
DB_USER_EXPDB_WRITE=root
DB_PASS_EXPDB_WRITE=ok
DB_USER_EXPDB_READ=root
DB_PASS_EXPDB_READ=ok
ES_URL=elasticsearch:9200
ES_PASSWORD=default
INDEX_ES_DURING_STARTUP=false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if .env files are properly ignored
echo "=== Checking .gitignore for .env patterns ==="
if [ -f .gitignore ]; then
    grep -n "\.env" .gitignore || echo "No .env patterns found in .gitignore"
else
    echo ".gitignore not found"
fi

echo -e "\n=== Checking if this .env file is tracked by git ==="
git ls-files docker/php/.env && echo "WARNING: .env file is tracked!" || echo "OK: .env file is not tracked"

echo -e "\n=== Searching for .env.example or similar template files ==="
find . -name ".env*" -type f | grep -v ".git" | head -20

echo -e "\n=== Checking git status of docker/php/.env ==="
git status docker/php/.env 2>/dev/null || echo "File not found or not in git"

Repository: openml/server-api

Length of output: 438


🏁 Script executed:

# Check the .gitignore content around line 126 to see the pattern
echo "=== .gitignore content around line 126 ==="
sed -n '120,130p' .gitignore

echo -e "\n=== Full .env patterns in .gitignore ==="
grep -n "env" .gitignore

echo -e "\n=== Check if docker directory has any .gitignore ==="
cat docker/.gitignore 2>/dev/null || echo "No docker/.gitignore"

echo -e "\n=== Check if docker/php has any .gitignore ==="
cat docker/php/.gitignore 2>/dev/null || echo "No docker/php/.gitignore"

Repository: openml/server-api

Length of output: 745


Remove docker/php/.env from version control and create a template file.

This file contains sensitive credentials (API keys, database passwords) and is currently tracked by git despite having .env in .gitignore. The pattern .env only matches files at the repository root, not docker/php/.env in subdirectories.

Actions needed:

  1. Update .gitignore to use **/.env or add explicit docker/**/.env to catch all .env files
  2. Create docker/php/.env.example with placeholder values for each variable
  3. Remove the committed file from git history: git rm --cached docker/php/.env
  4. Document the setup process in your README with instructions to copy the example file and populate real values
🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 4-4: [UnorderedKey] The DB_HOST_OPENML key should go before the MINIO_URL key

(UnorderedKey)


[warning] 5-5: [UnorderedKey] The DB_HOST_EXPDB key should go before the DB_HOST_OPENML key

(UnorderedKey)


[warning] 6-6: [UnorderedKey] The DB_USER_OPENML key should go before the MINIO_URL key

(UnorderedKey)


[warning] 7-7: [UnorderedKey] The DB_PASS_OPENML key should go before the DB_USER_OPENML key

(UnorderedKey)


[warning] 8-8: [UnorderedKey] The DB_USER_EXPDB_WRITE key should go before the DB_USER_OPENML key

(UnorderedKey)


[warning] 9-9: [UnorderedKey] The DB_PASS_EXPDB_WRITE key should go before the DB_PASS_OPENML key

(UnorderedKey)


[warning] 10-10: [UnorderedKey] The DB_USER_EXPDB_READ key should go before the DB_USER_EXPDB_WRITE key

(UnorderedKey)


[warning] 11-11: [UnorderedKey] The DB_PASS_EXPDB_READ key should go before the DB_PASS_EXPDB_WRITE key

(UnorderedKey)


[warning] 12-12: [UnorderedKey] The ES_URL key should go before the MINIO_URL key

(UnorderedKey)


[warning] 13-13: [UnorderedKey] The ES_PASSWORD key should go before the ES_URL key

(UnorderedKey)


[warning] 14-14: [UnorderedKey] The INDEX_ES_DURING_STARTUP key should go before the MINIO_URL key

(UnorderedKey)

🤖 Prompt for AI Agents
In docker/php/.env lines 1-14: this tracked .env contains secrets and must be
removed from VCS and replaced with a template; update .gitignore to match env
files in subdirectories (add pattern **/.env or docker/**/.env), create
docker/php/.env.example with placeholder values for each variable shown, run git
rm --cached docker/php/.env and commit the removal, and if you need to purge
secrets from repository history use a history-rewrite tool (git filter-repo or
BFG) to remove docker/php/.env from past commits; finally add README
instructions telling developers to copy docker/php/.env.example to
docker/php/.env and populate real values.

Comment on lines +19 to +21
[routing]
minio_url="http://minio:9000/"
server_url="http://php-api:80/"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find Python files that might load configuration
find . -name "*.py" -type f | head -20

# Search for toml loading patterns
rg -n "toml|config" --type=py -i | head -30

# Check the src/config.toml file
if [ -f "src/config.toml" ]; then
  cat -n src/config.toml
fi

# Search for environment variable patterns in Python files
rg -n "environ|getenv|os.env" --type=py | head -20

Repository: openml/server-api

Length of output: 4101


Routing configuration lacks environment variable override support.

The routing URLs use Docker service names (minio:9000, php-api:80), which are appropriate for containerized environments. However, unlike the database configuration which supports environment variable overrides (e.g., OPENML_DATABASES_OPENML_USERNAME), the load_routing_configuration() function does not implement similar environment variable override support for the routing URLs. For non-containerized deployments (local development, production), consider adding environment variable override support for minio_url and server_url to the configuration loading logic.

🤖 Prompt for AI Agents
In src/config.toml around lines 19–21 and the corresponding
load_routing_configuration() loader, the routing block hardcodes minio_url and
server_url and lacks environment variable override support; update the loader to
check for environment variables (e.g. OPENML_ROUTING_MINIO_URL and
OPENML_ROUTING_SERVER_URL) and, if present and non-empty, use their values
instead of the toml values, validating basic URL format and preserving the toml
defaults when the env vars are absent or invalid so non-containerized
deployments can override these endpoints easily.

Comment on lines +52 to +53
flow_rows = database.flows.get_subflows(flow_id, expdb)
subflows = [get_flow(flow_id=flow.child_id, expdb=expdb) for flow in flow_rows]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential infinite recursion with circular subflow references.

The recursive call to get_flow for each subflow has no cycle detection. If the database contains a circular reference (flow A → flow B → flow A), this will cause infinite recursion and a stack overflow.

Proposed fix with cycle detection
 @router.get("/{flow_id}")
-def get_flow(flow_id: int, expdb: Annotated[Connection, Depends(expdb_connection)] = None) -> Flow:
+def get_flow(
+    flow_id: int,
+    expdb: Annotated[Connection, Depends(expdb_connection)] = None,
+    _visited: set[int] | None = None,
+) -> Flow:
+    if _visited is None:
+        _visited = set()
+    if flow_id in _visited:
+        raise HTTPException(status_code=HTTPStatus.INTERNAL_SERVER_ERROR, detail="Circular subflow reference detected")
+    _visited.add(flow_id)
+
     flow = database.flows.get(flow_id, expdb)
     ...
     flow_rows = database.flows.get_subflows(flow_id, expdb)
-    subflows = [get_flow(flow_id=flow.child_id, expdb=expdb) for flow in flow_rows]
+    subflows = [get_flow(flow_id=flow.child_id, expdb=expdb, _visited=_visited.copy()) for flow in flow_rows]

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/routers/openml/flows.py around lines 52 to 53, the list comprehension
calls get_flow recursively for each subflow with no cycle detection which can
cause infinite recursion for circular subflow references; update the code to
track visited flow IDs (e.g., pass a visited set or list into get_flow or
maintain one in the caller), check each child_id against visited before
recursing, skip or return a placeholder for already-seen IDs (and optionally log
a warning), and ensure the visited set is copied or restored appropriately so
sibling branches are handled correctly to prevent stack overflow while
preserving the existing return shape.

Comment on lines +44 to 46
def php_api() -> httpx.Client:
with httpx.Client(base_url=PHP_API_URL) as client:
yield client
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct the return type annotation for the generator fixture.

The php_api fixture uses yield, making it a generator function. The return type should be Iterator[httpx.Client] rather than httpx.Client.

🔎 Proposed fix
 @pytest.fixture
-def php_api() -> httpx.Client:
+def php_api() -> Iterator[httpx.Client]:
     with httpx.Client(base_url=PHP_API_URL) as client:
         yield client

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tests/conftest.py around lines 44 to 46, the php_api fixture is a generator
that yields an httpx.Client but is annotated with return type httpx.Client;
change the annotation to Iterator[httpx.Client] (import Iterator from typing) so
the fixture's type matches its generator behavior and update imports
accordingly.

@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 92.10526% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@217d62d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/main.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #218   +/-   ##
=======================================
  Coverage        ?   76.01%           
=======================================
  Files           ?       51           
  Lines           ?     1851           
  Branches        ?      144           
=======================================
  Hits            ?     1407           
  Misses          ?      408           
  Partials        ?       36           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PGijsbers PGijsbers merged commit 6ac2991 into main Dec 19, 2025
7 checks passed
@PGijsbers PGijsbers deleted the dev-rebased branch December 19, 2025 08:57
@coderabbitai coderabbitai bot mentioned this pull request Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants