-
-
Notifications
You must be signed in to change notification settings - Fork 13
Move extra dev commits into main #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently still maintain the relevant definition files in this repository to allow them to change independently for a little while when the server is under most active development. We can then consider which changes should be merged to services to reduce duplication again.
WalkthroughThe PR updates the CI/CD pipeline to use GitHub Actions v6, refactors Docker orchestration with service profiles and healthchecks, introduces environment-driven configuration for routing URLs, restructures flow subflows from dict-based to direct Flow objects, and synchronizes test data with updated service URLs and database states. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Reviewer's GuideAligns the Python REST API’s routing, flow/task schemas, and tests with the PHP API and new Docker/CI stack by centralizing routing configuration, simplifying flow subflow representation, updating dataset/task URL generation, and modernizing docker-compose and GitHub Actions workflows. Sequence diagram for recursive flow retrieval with inlined subflowssequenceDiagram
actor Client
participant PythonAPI as Python_API
participant DB as Database
Client->>PythonAPI: GET /flows/{flow_id}
activate PythonAPI
PythonAPI->>DB: get_flow(flow_id)
DB-->>PythonAPI: FlowRow
PythonAPI->>DB: get_subflows(flow_id)
DB-->>PythonAPI: SubflowRows
loop For each subflow in SubflowRows
PythonAPI->>PythonAPI: get_flow(child_id)
note over PythonAPI: Recursive call building Flow
end
PythonAPI-->>Client: Flow (subflows: list[Flow])
deactivate PythonAPI
Updated class diagram for config and flow schemasclassDiagram
direction LR
class ConfigModule {
+CONFIG_PATH : Path
+_load_configuration(file Path) TomlTable
+load_routing_configuration(file Path) TomlTable
+load_database_configuration(file Path) TomlTable
}
class Flow {
+int id_
+int uploader
+datetime created
+str name
+str full_name
+str external_version
+str description
+str model
+str[] components
+str[] parameters_instructions
+str[] parameters_meta_info
+str binary_url
+str binary_format
+str library
+str language
+str dependencies
+Parameter[] parameter
+Flow[] subflows
+str[] tag
}
class Parameter {
+str name
+str data_type
+str default_value
+bool is_required
}
class JsonLDGraph {
+str | dict~str, HttpUrl~ context
+list~Distribution | DataService | Dataset | Quality | Feature | Agent | MD5Checksum~ graph
}
class Subflow {
<<removed>>
+str identifier
+Flow flow
}
ConfigModule <.. Flow : used_by
ConfigModule <.. JsonLDGraph : type_config
Flow "1" o-- "*" Parameter : has
Flow "1" o-- "*" Flow : subflows
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In
docker/database/update.sh, the shebang is currently#/bin/bash; this should be#!/bin/bashso the script runs correctly when executed directly. - Unlike
load_database_configuration,load_routing_configurationis not cached and calls_load_configurationon every use; if this is invoked frequently at runtime, consider applyingfunctools.cache(or reusing the cached config) to avoid repeated file reads and parsing. - In
tests/routers/openml/migration/datasets_migration_test.py, the line[private_dataset] = tests.constants.PRIVATE_DATASET_IDnow referencestests.constants, but onlyconstantsappears to be imported previously; either importtestsor revert to using the existingconstantsalias to avoid aNameError.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `docker/database/update.sh`, the shebang is currently `#/bin/bash`; this should be `#!/bin/bash` so the script runs correctly when executed directly.
- Unlike `load_database_configuration`, `load_routing_configuration` is not cached and calls `_load_configuration` on every use; if this is invoked frequently at runtime, consider applying `functools.cache` (or reusing the cached config) to avoid repeated file reads and parsing.
- In `tests/routers/openml/migration/datasets_migration_test.py`, the line `[private_dataset] = tests.constants.PRIVATE_DATASET_ID` now references `tests.constants`, but only `constants` appears to be imported previously; either import `tests` or revert to using the existing `constants` alias to avoid a `NameError`.
## Individual Comments
### Comment 1
<location> `docker-compose.yaml:13-16` </location>
<code_context>
+ healthcheck:
+ test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
+ start_period: 30s
+ start_interval: 1s
+ timeout: 3s
+ interval: 5s
+ retries: 10
+
+ database-setup:
</code_context>
<issue_to_address>
**issue (bug_risk):** The `start_interval` key is not a valid Docker healthcheck option and will be ignored.
Docker healthchecks only support `test`, `interval`, `timeout`, `retries`, and `start_period`; `start_interval` is ignored. To control check frequency, use `interval` (and optionally `start_period`) instead, and remove `start_interval` here and in the Elasticsearch service or replace it with a supported option.
</issue_to_address>
### Comment 2
<location> `docker-compose.yaml:51-53` </location>
<code_context>
+ healthcheck:
+ test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
+ start_period: 30s
+ start_interval: 5s
+ timeout: 3s
+ interval: 10s
+ deploy:
+ resources:
</code_context>
<issue_to_address>
**issue (bug_risk):** Same `start_interval` issue in the Elasticsearch healthcheck configuration.
As with the database service, `start_interval` isn’t a valid Docker healthcheck option and will be ignored. To control timing, rely on `start_period` with `interval`, or just adjust `interval`.
</issue_to_address>
### Comment 3
<location> `docker/database/update.sh:5-8` </location>
<code_context>
+# Change the filepath of openml.file
+# from "https://www.openml.org/data/download/1666876/phpFsFYVN"
+# to "http://minio:9000/datasets/0000/0001/phpFsFYVN"
+mysql -hdatabase -uroot -pok -e 'UPDATE openml.file SET filepath = CONCAT("http://minio:9000/datasets/0000/", LPAD(id, 4, "0"), "/", SUBSTRING_INDEX(filepath, "/", -1)) WHERE extension="arff";'
+
+# Update openml.expdb.dataset with the same url
+mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;'
+
+
</code_context>
<issue_to_address>
**issue (bug_risk):** The MinIO dataset URL pattern here is inconsistent with the application’s routing logic and will fail for larger dataset IDs.
In `src/core/formatting.py`, dataset URLs use a computed ten-thousands prefix (`dataset.did // 10_000:04d`) plus a padded ID. This SQL instead hardcodes `datasets/0000/` and only pads `id` to 4 digits, so `id >= 10000` will map to the wrong path. If this script should work for all datasets, please compute both the prefix and padded ID in SQL (e.g., using integer division and `LPAD`) to match the Python logic.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| start_interval: 1s | ||
| timeout: 3s | ||
| interval: 5s | ||
| retries: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The start_interval key is not a valid Docker healthcheck option and will be ignored.
Docker healthchecks only support test, interval, timeout, retries, and start_period; start_interval is ignored. To control check frequency, use interval (and optionally start_period) instead, and remove start_interval here and in the Elasticsearch service or replace it with a supported option.
| start_interval: 5s | ||
| timeout: 3s | ||
| interval: 10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Same start_interval issue in the Elasticsearch healthcheck configuration.
As with the database service, start_interval isn’t a valid Docker healthcheck option and will be ignored. To control timing, rely on start_period with interval, or just adjust interval.
| mysql -hdatabase -uroot -pok -e 'UPDATE openml.file SET filepath = CONCAT("http://minio:9000/datasets/0000/", LPAD(id, 4, "0"), "/", SUBSTRING_INDEX(filepath, "/", -1)) WHERE extension="arff";' | ||
|
|
||
| # Update openml.expdb.dataset with the same url | ||
| mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The MinIO dataset URL pattern here is inconsistent with the application’s routing logic and will fail for larger dataset IDs.
In src/core/formatting.py, dataset URLs use a computed ten-thousands prefix (dataset.did // 10_000:04d) plus a padded ID. This SQL instead hardcodes datasets/0000/ and only pads id to 4 digits, so id >= 10000 will map to the wrong path. If this script should work for all datasets, please compute both the prefix and padded ID in SQL (e.g., using integer division and LPAD) to match the Python logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/config.py (1)
32-55: Critical caching bug: environment variables are ignored after first call.The function is decorated with
@functools.cache, but it reads environment variables (lines 39-54) and callsload_dotenv()(line 38) inside the cached function. After the first invocation, the cached result is returned without re-reading environment variables or re-executingload_dotenv().This breaks environment-based configuration:
# First call config = load_database_configuration() # Uses current env vars # Environment changes os.environ["OPENML_DATABASES_OPENML_USERNAME"] = "new_user" # Second call returns stale cached result config = load_database_configuration() # Still has old username ❌🔎 Recommended fix: Remove caching or restructure
Option 1 (Recommended): Remove caching from this function
-@functools.cache def load_database_configuration(file: Path = CONFIG_PATH) -> TomlTable: configuration = _load_configuration(file)Option 2: Cache only the file-based configuration, apply env overrides separately
@functools.cache def _load_base_database_configuration(file: Path) -> TomlTable: configuration = _load_configuration(file) return _apply_defaults_to_siblings(configuration["databases"]) def load_database_configuration(file: Path = CONFIG_PATH) -> TomlTable: database_configuration = _load_base_database_configuration(file).copy() load_dotenv() database_configuration["openml"]["username"] = os.environ.get( "OPENML_DATABASES_OPENML_USERNAME", "root" ) # ... rest of env overrides return database_configurationOption 2 maintains performance benefits of caching the file I/O while correctly handling environment variable overrides on each call.
🧹 Nitpick comments (13)
src/schemas/datasets/mldcat_ap.py (1)
278-278: Consider fixing the type mismatch instead of suppressing it.The
# type: ignore[arg-type]suppresses a legitimate type error:default_factory=dictreturns an untypeddict, notdict[str, HttpUrl]. While this works at runtime, it reduces type safety.🔎 Proposed fix to properly type the default factory
- context: str | dict[str, HttpUrl] = Field(default_factory=dict, serialization_alias="@context") # type: ignore[arg-type] + context: str | dict[str, HttpUrl] = Field(default_factory=lambda: {}, serialization_alias="@context")Alternatively, if the context should always be a properly typed dict when initialized as a dict:
- context: str | dict[str, HttpUrl] = Field(default_factory=dict, serialization_alias="@context") # type: ignore[arg-type] + context: str | dict[str, HttpUrl] = Field(default_factory=lambda: dict[str, HttpUrl](), serialization_alias="@context")src/routers/openml/tasks.py (1)
143-144: Consider caching the routing configuration to avoid repeated file I/O.The
load_routing_configuration()call is executed within_fill_json_template, which is called recursively for nested template structures. This means the configuration file is read from disk multiple times during a single request.For better performance, consider loading the routing configuration once at the module level, or passing it as a parameter through the call chain.
🔎 Proposed refactor options
Option 1: Module-level constant (if config is static during runtime)
+_ROUTING_CONFIG = config.load_routing_configuration() + def _fill_json_template( template: dict[str, Any], task: RowMapping, task_inputs: dict[str, str], fetched_data: dict[str, Any], connection: Connection, ) -> dict[str, Any] | list[dict[str, Any]] | str: # ... existing code ... template = template.replace("[TASK:id]", str(task.task_id)) - server_url = config.load_routing_configuration()["server_url"] + server_url = _ROUTING_CONFIG["server_url"] return template.replace("[CONSTANT:base_url]", server_url)Option 2: Pass as parameter (if config might change)
def fill_template( template: str, task: RowMapping, task_inputs: dict[str, str], connection: Connection, ) -> Any: # noqa: ANN401 json_template = convert_template_xml_to_json(template) + routing_config = config.load_routing_configuration() return _fill_json_template( json_template, task, task_inputs, fetched_data={}, connection=connection, + routing_config=routing_config, ).github/workflows/tests.yml (2)
24-24: Consider shell script improvements for robustness.The static analysis tool suggests improvements to the shell command for better reliability.
🔎 Proposed improvements
- - run: docker compose --profile "python" --profile "php" up --detach --wait --remove-orphans || exit $(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -v '^0' | wc -l) + - run: docker compose --profile "python" --profile "php" up --detach --wait --remove-orphans || exit "$(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -vc '^0')"Changes:
- Quote the command substitution to prevent word splitting
- Use
grep -cinstead ofgrep | wc -lfor cleaner syntax
38-38: Consider shell script improvements for robustness.The static analysis tool suggests improvements to the shell command for better reliability.
🔎 Proposed improvements
- - run: docker compose --profile "python" up --detach --wait --remove-orphans || exit $(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -v '^0' | wc -l) + - run: docker compose --profile "python" up --detach --wait --remove-orphans || exit "$(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -vc '^0')"Changes:
- Quote the command substitution to prevent word splitting
- Use
grep -cinstead ofgrep | wc -lfor cleaner syntaxsrc/core/formatting.py (2)
28-31: Consider normalizing trailing slashes in URL construction.If
minio_urlfrom config doesn't end with/, the resulting URL will be malformed (e.g.,http://minio:9000datasets/...). The same applies toserver_urlon line 35-37.Proposed fix to ensure trailing slash
def _format_parquet_url(dataset: Row) -> str | None: if dataset.format.lower() != DatasetFileFormat.ARFF: return None - minio_base_url = load_routing_configuration()["minio_url"] + minio_base_url = load_routing_configuration()["minio_url"].rstrip("/") + "/" ten_thousands_prefix = f"{dataset.did // 10_000:04d}" padded_id = f"{dataset.did:04d}" return f"{minio_base_url}datasets/{ten_thousands_prefix}/{padded_id}/dataset_{dataset.did}.pq"
28-31: Repeated config loading on every call.
load_routing_configuration()is called on each invocation of these formatting functions. If these are called frequently (e.g., in a loop formatting many datasets), consider caching the config or passing it as a parameter to avoid repeated file I/O.Also applies to: 35-37
docker/database/update.sh (2)
1-8: Add error handling to the script.The script lacks error handling. If any MySQL command fails, subsequent commands will still execute, potentially leaving the database in an inconsistent state.
Proposed fix
#!/bin/bash +set -e # Exit immediately if a command exits with a non-zero status + # Change the filepath of openml.file # from "https://www.openml.org/data/download/1666876/phpFsFYVN" # to "http://minio:9000/datasets/0000/0001/phpFsFYVN"
10-13: Remove excessive blank lines.Multiple consecutive blank lines reduce readability.
Proposed fix
mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;' - - - - # Create the data_feature_description TABLE. TODO: can we make sure this table exists already?src/routers/openml/flows.py (1)
52-53: N+1 query pattern for subflows.Each subflow triggers a separate
get_flowcall with its own database queries (flow, parameters, tags, subflows). For flows with many nested subflows, this could result in significant database load. Consider batching queries or lazy loading if performance becomes an issue.docker-compose.yaml (2)
70-74: Inconsistent depends_on conditions.
php-apiusescondition: service_healthyfor elasticsearch butcondition: service_startedfor database, even though database now has a healthcheck. Consider usingservice_healthyfor database as well to ensure it's ready before php-api starts.Proposed fix
depends_on: elasticsearch: condition: service_healthy database: - condition: service_started + condition: service_healthy
92-93: python-api depends_on lacks explicit condition.Unlike other services,
python-apiuses a simple list syntax fordepends_oninstead of explicit conditions. Consider addingcondition: service_healthyfor consistency with the database healthcheck.Proposed fix
depends_on: - - database + database: + condition: service_healthysrc/config.py (2)
28-29: Consider adding error handling for missing configuration section.Direct dictionary access will raise a generic
KeyErrorif the "routing" section is missing fromconfig.toml. Consider wrapping with a try-except to provide a more descriptive error message.🔎 Proposed enhancement for better error messages
def load_routing_configuration(file: Path = CONFIG_PATH) -> TomlTable: - return typing.cast(TomlTable, _load_configuration(file)["routing"]) + config = _load_configuration(file) + if "routing" not in config: + raise ValueError(f"Missing 'routing' section in configuration file: {file}") + return typing.cast(TomlTable, config["routing"])
58-59: Consider using the new_load_configurationhelper for consistency.This function directly calls
tomllib.loads(file.read_text()), while the new pattern introduced in this PR uses the cached_load_configurationhelper. Additionally, it defines the default path inline rather than using theCONFIG_PATHconstant.🔎 Suggested refactor for consistency
-def load_configuration(file: Path = Path(__file__).parent / "config.toml") -> TomlTable: - return tomllib.loads(file.read_text()) +def load_configuration(file: Path = CONFIG_PATH) -> TomlTable: + return _load_configuration(file)This improves consistency across the module and leverages the caching benefit of
_load_configuration.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
.github/workflows/tests.yml(1 hunks).gitignore(1 hunks)docker-compose.yaml(3 hunks)docker/database/update.sh(1 hunks)docker/elasticsearch/.env(1 hunks)docker/mysql/README.md(1 hunks)docker/php/.env(1 hunks)docker/readme.md(1 hunks)src/config.py(2 hunks)src/config.toml(1 hunks)src/core/formatting.py(2 hunks)src/main.py(1 hunks)src/routers/openml/flows.py(1 hunks)src/routers/openml/tasks.py(2 hunks)src/schemas/datasets/mldcat_ap.py(1 hunks)src/schemas/flows.py(2 hunks)tests/conftest.py(2 hunks)tests/constants.py(1 hunks)tests/routers/openml/datasets_test.py(5 hunks)tests/routers/openml/flows_test.py(1 hunks)tests/routers/openml/migration/datasets_migration_test.py(3 hunks)tests/routers/openml/migration/flows_migration_test.py(2 hunks)tests/routers/openml/task_test.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/routers/openml/tasks.py (1)
src/config.py (1)
load_routing_configuration(28-29)
src/core/formatting.py (1)
src/config.py (1)
load_routing_configuration(28-29)
src/routers/openml/flows.py (2)
src/database/flows.py (1)
get_subflows(7-20)tests/conftest.py (1)
flow(79-89)
🪛 actionlint (1.7.9)
.github/workflows/tests.yml
24-24: shellcheck reported issue in this script: SC2046:warning:1:95: Quote this to prevent word splitting
(shellcheck)
24-24: shellcheck reported issue in this script: SC2126:style:1:168: Consider using 'grep -c' instead of 'grep|wc -l'
(shellcheck)
38-38: shellcheck reported issue in this script: SC2046:warning:1:79: Quote this to prevent word splitting
(shellcheck)
38-38: shellcheck reported issue in this script: SC2126:style:1:152: Consider using 'grep -c' instead of 'grep|wc -l'
(shellcheck)
🪛 dotenv-linter (4.0.0)
docker/php/.env
[warning] 4-4: [UnorderedKey] The DB_HOST_OPENML key should go before the MINIO_URL key
(UnorderedKey)
[warning] 5-5: [UnorderedKey] The DB_HOST_EXPDB key should go before the DB_HOST_OPENML key
(UnorderedKey)
[warning] 6-6: [UnorderedKey] The DB_USER_OPENML key should go before the MINIO_URL key
(UnorderedKey)
[warning] 7-7: [UnorderedKey] The DB_PASS_OPENML key should go before the DB_USER_OPENML key
(UnorderedKey)
[warning] 8-8: [UnorderedKey] The DB_USER_EXPDB_WRITE key should go before the DB_USER_OPENML key
(UnorderedKey)
[warning] 9-9: [UnorderedKey] The DB_PASS_EXPDB_WRITE key should go before the DB_PASS_OPENML key
(UnorderedKey)
[warning] 10-10: [UnorderedKey] The DB_USER_EXPDB_READ key should go before the DB_USER_EXPDB_WRITE key
(UnorderedKey)
[warning] 11-11: [UnorderedKey] The DB_PASS_EXPDB_READ key should go before the DB_PASS_EXPDB_WRITE key
(UnorderedKey)
[warning] 12-12: [UnorderedKey] The ES_URL key should go before the MINIO_URL key
(UnorderedKey)
[warning] 13-13: [UnorderedKey] The ES_PASSWORD key should go before the ES_URL key
(UnorderedKey)
[warning] 14-14: [UnorderedKey] The INDEX_ES_DURING_STARTUP key should go before the MINIO_URL key
(UnorderedKey)
docker/elasticsearch/.env
[warning] 2-2: [IncorrectDelimiter] The discovery.type key has incorrect delimiter
(IncorrectDelimiter)
[warning] 2-2: [LowercaseKey] The discovery.type key should be in uppercase
(LowercaseKey)
[warning] 3-3: [IncorrectDelimiter] The xpack.security.enabled key has incorrect delimiter
(IncorrectDelimiter)
[warning] 3-3: [LowercaseKey] The xpack.security.enabled key should be in uppercase
(LowercaseKey)
🪛 Shellcheck (0.11.0)
docker/database/update.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Sourcery review
- GitHub Check: Sourcery review
- GitHub Check: compare-php
🔇 Additional comments (24)
.gitignore (1)
2-2: LGTM!Adding
.DS_Storeto the ignore list is a standard best practice. macOS system files should never be tracked in version control.docker/mysql/README.md (1)
19-19: LGTM!The concrete MySQL connection example improves the documentation by showing users exactly how to connect to the test database.
docker/elasticsearch/.env (1)
1-3: LGTM! Static analysis warnings are false positives.The dotenv-linter warnings about uppercase keys and delimiters can be safely ignored. These are Elasticsearch-specific configuration keys that must use the exact case and format shown (e.g.,
discovery.type,xpack.security.enabled). The disabled security settings are appropriate for a test/development environment.docker/readme.md (1)
17-17: LGTM!The updated link to Docker's official documentation is more concise and maintainable than the previous longer URL.
src/schemas/flows.py (2)
28-28: The recursive Flow reference is well-typed.Using
Selffor recursive Flow references is the appropriate pattern and makes the schema cleaner than the previous TypedDict approach.
4-4: Structural change to subflows is complete and all consumers have been updated.The change from
list[Subflow](TypedDict with "identifier" and "flow" keys) tolist[Self](direct Flow objects) is confirmed as a breaking API change. Verification shows all production code and tests have been properly updated:
- Router builds subflows recursively via
get_flow()calls, creating nested Flow objects- No old dictionary access patterns remain in the codebase
- All unit tests expect and validate the new nested Flow object structure
- Migration test properly handles conversion between old ("component"/"flow" format) and new structure for backward compatibility
The implementation is complete and consistent across the codebase.
tests/routers/openml/task_test.py (1)
22-22: LGTM!The updated URL correctly reflects the new routing configuration using the internal Docker service name (
http://php-api:80/) instead of the external production URL. This aligns with the routing config introduced insrc/config.toml.src/main.py (1)
24-40: LGTM!Assigning the
add_argumentreturn values to_is a common Python idiom to explicitly indicate these values are intentionally unused. This likely addresses linter warnings while maintaining the same functional behavior.tests/constants.py (1)
2-3: LGTM! Test constants updated to reflect current database state.The removal of dataset IDs 1 and 2 from
IN_PREPARATION_IDandDEACTIVATED_DATASETSrespectively aligns with the test data changes intests/routers/openml/datasets_test.pywhere dataset 1 is now marked as "active" (line 55).tests/routers/openml/migration/datasets_migration_test.py (2)
8-8: LGTM! Import standardization improves consistency.The updated import to use the fully qualified
tests.constantsmodule path aligns with the pattern used across other test files and improves code clarity.Also applies to: 133-133
46-47: Good addition of explanatory comment for URL normalization.The inline comment clearly explains why the port is being removed, documenting Pydantic's HttpURL serialization behavior for future maintainers.
tests/routers/openml/datasets_test.py (2)
48-49: LGTM! Test data updated to reflect routing configuration and database state.The URL updates to use service names (
http://php-apiandhttp://minio:9000) align with the new routing configuration approach. The status change from "in_preparation" to "active" for dataset 1 is consistent with its removal fromIN_PREPARATION_IDintests/constants.py.Also applies to: 55-55
226-226: LGTM! Using constants improves test maintainability.The use of
next(iter(constants.IN_PREPARATION_ID))andnext(iter(constants.DEACTIVATED_DATASETS))correctly extracts the single dataset ID from each set, making tests more maintainable when test data changes.Also applies to: 236-236, 246-246
tests/conftest.py (1)
19-19: LGTM! Centralized PHP API URL improves maintainability.Defining
PHP_API_URLas a module-level constant makes it easier to update the URL in one place and improves consistency across fixtures.tests/routers/openml/flows_test.py (1)
224-358: LGTM! Subflow structure simplified for better API ergonomics.The flattened subflow structure (removing the nested
{"identifier": None, "flow": {...}}wrapper) provides a cleaner, more intuitive API response. This aligns with the backend changes whereget_subflowsnow returns Flow objects directly..github/workflows/tests.yml (2)
25-26: LGTM! Updated to use the correct Python API container.The container reference has been updated from a generic name to
openml-python-rest-api, which aligns with the docker-compose service definitions.Also applies to: 39-40
18-19: Both actions/checkout@v6 and actions/setup-python@v6 are available and widely documented. The checkout@v6 version stores credentials under $RUNNER_TEMP instead of the git config, requiring a minimum Actions Runner version of v2.329.0 for compatibility. Verify your runner meets this requirement; otherwise no special compatibility considerations apply.docker/database/update.sh (1)
5-5: Hardcoded URL should align with routing configuration.The minio URL
http://minio:9000/is hardcoded here but also defined insrc/config.tomlasminio_url. Consider either documenting this dependency or extracting the URL to avoid drift between the database script and the application config.tests/routers/openml/migration/flows_migration_test.py (2)
63-72: LGTM on the recursive conversion logic.The in-place recursive call to
convert_flow_naming_and_defaults(subflow)correctly handles the new flat Flow structure where subflows are Flow objects instead of dicts with "identifier" and "flow" keys.
78-79: Remove invalid concern about single vs. multiple subflows handling.Lines 78-79 correctly normalize the PHP API's single-component dict structure by extracting the nested
"flow"value. The code properly handles the PHP API response format wherecomponentis a dict with a"flow"key, not a list. The iteration over multiple subflows occurs in the new API's"subflows"list (line 67), which is distinct from the PHP API's"component"field. Thenested_remove_single_element_listcall on line 75 then normalizes single-element lists appropriately. No issue exists with this normalization.docker-compose.yaml (1)
18-27: LGTM on database-setup service.The service correctly waits for database to be healthy before running the setup script, and the volume mount provides the script at the expected path.
src/routers/openml/flows.py (1)
52-53: Identifier field exclusion from subflows is intentional.The identifier field is retrieved from the database in
get_subflows()but intentionally not included in the Flow response schema. This change is confirmed by the migration test, which explicitly notes the identifier is omitted (similar to how parameter default values are handled). No code in the repository references subflow identifiers, confirming no downstream consumers are affected by this omission.src/config.py (2)
11-12: LGTM!Clean introduction of a module-level constant for configuration file path using proper Path handling.
24-26: LGTM!Excellent use of caching for file I/O operations. The cached helper reduces redundant file reads when the same configuration file is accessed multiple times.
| healthcheck: | ||
| test: curl 127.0.0.1:9200/_cluster/health | grep -e "green" | ||
| start_period: 30s | ||
| start_interval: 5s | ||
| timeout: 3s | ||
| interval: 10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elasticsearch healthcheck may be too strict for single-node clusters.
The healthcheck only accepts "green" status, but single-node Elasticsearch clusters typically report "yellow" because replicas cannot be allocated. This may cause the healthcheck to fail indefinitely.
Proposed fix to accept both green and yellow
healthcheck:
- test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
+ test: curl -s 127.0.0.1:9200/_cluster/health | grep -E '"status":"(green|yellow)"'
start_period: 30s📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| healthcheck: | |
| test: curl 127.0.0.1:9200/_cluster/health | grep -e "green" | |
| start_period: 30s | |
| start_interval: 5s | |
| timeout: 3s | |
| interval: 10s | |
| healthcheck: | |
| test: curl -s 127.0.0.1:9200/_cluster/health | grep -E '"status":"(green|yellow)"' | |
| start_period: 30s | |
| start_interval: 5s | |
| timeout: 3s | |
| interval: 10s |
🤖 Prompt for AI Agents
In docker-compose.yaml around lines 48 to 53, the healthcheck currently only
accepts "green" which causes single-node Elasticsearch clusters reporting
"yellow" to fail; update the healthcheck command to consider both "green" and
"yellow" as healthy (for example use a silent curl and a regex match for
green|yellow or parse the JSON "status" field and check it is either "green" or
"yellow") so the healthcheck returns success for single-node scenarios while
still failing for red.
| @@ -0,0 +1,31 @@ | |||
| #/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the shebang syntax.
The shebang is missing the ! character. This will cause the script to not execute with bash semantics and may fail or behave unexpectedly.
Proposed fix
-#/bin/bash
+#!/bin/bash📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #/bin/bash | |
| #!/bin/bash |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
🤖 Prompt for AI Agents
In docker/database/update.sh around line 1, the shebang is written as
"#/bin/bash" and is missing the "!" — replace the incorrect shebang with a valid
bash shebang (#!/bin/bash) on the first line and ensure the script is executable
(chmod +x) so it runs with bash semantics.
| API_KEY=AD000000000000000000000000000000 | ||
| BASE_URL=http://php-api:80/ | ||
| MINIO_URL=http://minio:9000/ | ||
| DB_HOST_OPENML=database:3306 | ||
| DB_HOST_EXPDB=database:3306 | ||
| DB_USER_OPENML=root | ||
| DB_PASS_OPENML=ok | ||
| DB_USER_EXPDB_WRITE=root | ||
| DB_PASS_EXPDB_WRITE=ok | ||
| DB_USER_EXPDB_READ=root | ||
| DB_PASS_EXPDB_READ=ok | ||
| ES_URL=elasticsearch:9200 | ||
| ES_PASSWORD=default | ||
| INDEX_ES_DURING_STARTUP=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if .env files are properly ignored
echo "=== Checking .gitignore for .env patterns ==="
if [ -f .gitignore ]; then
grep -n "\.env" .gitignore || echo "No .env patterns found in .gitignore"
else
echo ".gitignore not found"
fi
echo -e "\n=== Checking if this .env file is tracked by git ==="
git ls-files docker/php/.env && echo "WARNING: .env file is tracked!" || echo "OK: .env file is not tracked"
echo -e "\n=== Searching for .env.example or similar template files ==="
find . -name ".env*" -type f | grep -v ".git" | head -20
echo -e "\n=== Checking git status of docker/php/.env ==="
git status docker/php/.env 2>/dev/null || echo "File not found or not in git"Repository: openml/server-api
Length of output: 438
🏁 Script executed:
# Check the .gitignore content around line 126 to see the pattern
echo "=== .gitignore content around line 126 ==="
sed -n '120,130p' .gitignore
echo -e "\n=== Full .env patterns in .gitignore ==="
grep -n "env" .gitignore
echo -e "\n=== Check if docker directory has any .gitignore ==="
cat docker/.gitignore 2>/dev/null || echo "No docker/.gitignore"
echo -e "\n=== Check if docker/php has any .gitignore ==="
cat docker/php/.gitignore 2>/dev/null || echo "No docker/php/.gitignore"Repository: openml/server-api
Length of output: 745
Remove docker/php/.env from version control and create a template file.
This file contains sensitive credentials (API keys, database passwords) and is currently tracked by git despite having .env in .gitignore. The pattern .env only matches files at the repository root, not docker/php/.env in subdirectories.
Actions needed:
- Update .gitignore to use
**/.envor add explicitdocker/**/.envto catch all .env files - Create
docker/php/.env.examplewith placeholder values for each variable - Remove the committed file from git history:
git rm --cached docker/php/.env - Document the setup process in your README with instructions to copy the example file and populate real values
🧰 Tools
🪛 dotenv-linter (4.0.0)
[warning] 4-4: [UnorderedKey] The DB_HOST_OPENML key should go before the MINIO_URL key
(UnorderedKey)
[warning] 5-5: [UnorderedKey] The DB_HOST_EXPDB key should go before the DB_HOST_OPENML key
(UnorderedKey)
[warning] 6-6: [UnorderedKey] The DB_USER_OPENML key should go before the MINIO_URL key
(UnorderedKey)
[warning] 7-7: [UnorderedKey] The DB_PASS_OPENML key should go before the DB_USER_OPENML key
(UnorderedKey)
[warning] 8-8: [UnorderedKey] The DB_USER_EXPDB_WRITE key should go before the DB_USER_OPENML key
(UnorderedKey)
[warning] 9-9: [UnorderedKey] The DB_PASS_EXPDB_WRITE key should go before the DB_PASS_OPENML key
(UnorderedKey)
[warning] 10-10: [UnorderedKey] The DB_USER_EXPDB_READ key should go before the DB_USER_EXPDB_WRITE key
(UnorderedKey)
[warning] 11-11: [UnorderedKey] The DB_PASS_EXPDB_READ key should go before the DB_PASS_EXPDB_WRITE key
(UnorderedKey)
[warning] 12-12: [UnorderedKey] The ES_URL key should go before the MINIO_URL key
(UnorderedKey)
[warning] 13-13: [UnorderedKey] The ES_PASSWORD key should go before the ES_URL key
(UnorderedKey)
[warning] 14-14: [UnorderedKey] The INDEX_ES_DURING_STARTUP key should go before the MINIO_URL key
(UnorderedKey)
🤖 Prompt for AI Agents
In docker/php/.env lines 1-14: this tracked .env contains secrets and must be
removed from VCS and replaced with a template; update .gitignore to match env
files in subdirectories (add pattern **/.env or docker/**/.env), create
docker/php/.env.example with placeholder values for each variable shown, run git
rm --cached docker/php/.env and commit the removal, and if you need to purge
secrets from repository history use a history-rewrite tool (git filter-repo or
BFG) to remove docker/php/.env from past commits; finally add README
instructions telling developers to copy docker/php/.env.example to
docker/php/.env and populate real values.
| [routing] | ||
| minio_url="http://minio:9000/" | ||
| server_url="http://php-api:80/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find Python files that might load configuration
find . -name "*.py" -type f | head -20
# Search for toml loading patterns
rg -n "toml|config" --type=py -i | head -30
# Check the src/config.toml file
if [ -f "src/config.toml" ]; then
cat -n src/config.toml
fi
# Search for environment variable patterns in Python files
rg -n "environ|getenv|os.env" --type=py | head -20Repository: openml/server-api
Length of output: 4101
Routing configuration lacks environment variable override support.
The routing URLs use Docker service names (minio:9000, php-api:80), which are appropriate for containerized environments. However, unlike the database configuration which supports environment variable overrides (e.g., OPENML_DATABASES_OPENML_USERNAME), the load_routing_configuration() function does not implement similar environment variable override support for the routing URLs. For non-containerized deployments (local development, production), consider adding environment variable override support for minio_url and server_url to the configuration loading logic.
🤖 Prompt for AI Agents
In src/config.toml around lines 19–21 and the corresponding
load_routing_configuration() loader, the routing block hardcodes minio_url and
server_url and lacks environment variable override support; update the loader to
check for environment variables (e.g. OPENML_ROUTING_MINIO_URL and
OPENML_ROUTING_SERVER_URL) and, if present and non-empty, use their values
instead of the toml values, validating basic URL format and preserving the toml
defaults when the env vars are absent or invalid so non-containerized
deployments can override these endpoints easily.
| flow_rows = database.flows.get_subflows(flow_id, expdb) | ||
| subflows = [get_flow(flow_id=flow.child_id, expdb=expdb) for flow in flow_rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential infinite recursion with circular subflow references.
The recursive call to get_flow for each subflow has no cycle detection. If the database contains a circular reference (flow A → flow B → flow A), this will cause infinite recursion and a stack overflow.
Proposed fix with cycle detection
@router.get("/{flow_id}")
-def get_flow(flow_id: int, expdb: Annotated[Connection, Depends(expdb_connection)] = None) -> Flow:
+def get_flow(
+ flow_id: int,
+ expdb: Annotated[Connection, Depends(expdb_connection)] = None,
+ _visited: set[int] | None = None,
+) -> Flow:
+ if _visited is None:
+ _visited = set()
+ if flow_id in _visited:
+ raise HTTPException(status_code=HTTPStatus.INTERNAL_SERVER_ERROR, detail="Circular subflow reference detected")
+ _visited.add(flow_id)
+
flow = database.flows.get(flow_id, expdb)
...
flow_rows = database.flows.get_subflows(flow_id, expdb)
- subflows = [get_flow(flow_id=flow.child_id, expdb=expdb) for flow in flow_rows]
+ subflows = [get_flow(flow_id=flow.child_id, expdb=expdb, _visited=_visited.copy()) for flow in flow_rows]Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/routers/openml/flows.py around lines 52 to 53, the list comprehension
calls get_flow recursively for each subflow with no cycle detection which can
cause infinite recursion for circular subflow references; update the code to
track visited flow IDs (e.g., pass a visited set or list into get_flow or
maintain one in the caller), check each child_id against visited before
recursing, skip or return a placeholder for already-seen IDs (and optionally log
a warning), and ensure the visited set is copied or restored appropriately so
sibling branches are handled correctly to prevent stack overflow while
preserving the existing return shape.
| def php_api() -> httpx.Client: | ||
| with httpx.Client(base_url=PHP_API_URL) as client: | ||
| yield client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the return type annotation for the generator fixture.
The php_api fixture uses yield, making it a generator function. The return type should be Iterator[httpx.Client] rather than httpx.Client.
🔎 Proposed fix
@pytest.fixture
-def php_api() -> httpx.Client:
+def php_api() -> Iterator[httpx.Client]:
with httpx.Client(base_url=PHP_API_URL) as client:
yield clientCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/conftest.py around lines 44 to 46, the php_api fixture is a generator
that yields an httpx.Client but is annotated with return type httpx.Client;
change the annotation to Iterator[httpx.Client] (import Iterator from typing) so
the fixture's type matches its generator behavior and update imports
accordingly.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #218 +/- ##
=======================================
Coverage ? 76.01%
=======================================
Files ? 51
Lines ? 1851
Branches ? 144
=======================================
Hits ? 1407
Misses ? 408
Partials ? 36 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I played myself. Applying my commits to main instead and then closing the dev branch.
Summary by Sourcery
Align routing, URLs, and flow schemas with the legacy PHP API and make the Docker/CI setup spin up a complete test stack for Python and PHP services.
New Features:
Bug Fixes:
Enhancements:
Build:
CI:
Documentation:
Tests:
Summary by CodeRabbit
New Features
Refactor
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.