Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 11 additions & 19 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,37 +15,29 @@ jobs:
compare-php:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
- uses: actions/checkout@v6
- uses: actions/setup-python@v6
with:
python-version: 3.x

# A naive `docker compose up` would first build the `python-api` container and then
# start all services, which kickstarts Elastic Search and building indices.
# But since those two steps are independent, we can parallelize them to save time.
- run: |
docker compose build python-api
docker compose up -d --wait python-api php-api
- run: docker container ls && docker image ls
- run: docker exec python-api python -m pip freeze
- run: docker exec python-api coverage run -m pytest -xv -m "php_api"
- run: docker exec python-api coverage xml
# https://github.com/docker/compose/issues/10596
- run: docker compose --profile "python" --profile "php" up --detach --wait --remove-orphans || exit $(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -v '^0' | wc -l)
- run: docker exec openml-python-rest-api coverage run -m pytest -v -m "php_api"
- run: docker exec openml-python-rest-api coverage xml
- name: Upload results to Codecov
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
python:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
- uses: actions/checkout@v6
- uses: actions/setup-python@v6
with:
python-version: 3.x
- run: docker compose up -d --wait database python-api
- run: docker container ls && docker image ls
- run: docker exec python-api python -m pip freeze
- run: docker exec python-api coverage run -m pytest -xv -m "not php_api"
- run: docker exec python-api coverage xml
- run: docker compose --profile "python" up --detach --wait --remove-orphans || exit $(docker compose ps -q | xargs docker inspect -f '{{.State.ExitCode}}' | grep -v '^0' | wc -l)
- run: docker exec openml-python-rest-api coverage run -m pytest -v -m "not php_api"
- run: docker exec openml-python-rest-api coverage xml
- name: Upload results to Codecov
uses: codecov/codecov-action@v4
with:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
docker/mysql/data
.DS_Store

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
71 changes: 51 additions & 20 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,33 @@
services:
database:
image: "openml/test-database"
profiles: ["python", "php", "all"]
image: "openml/test-database:20240105"
container_name: "openml-test-database"
environment:
MYSQL_ROOT_PASSWORD: ok
ports:
- "3306:3306"
healthcheck:
test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
start_period: 30s
start_interval: 1s
timeout: 3s
interval: 5s
retries: 10
Comment on lines +13 to +16
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The start_interval key is not a valid Docker healthcheck option and will be ignored.

Docker healthchecks only support test, interval, timeout, retries, and start_period; start_interval is ignored. To control check frequency, use interval (and optionally start_period) instead, and remove start_interval here and in the Elasticsearch service or replace it with a supported option.


database-setup:
profiles: ["python", "php", "all"]
image: mysql
container_name: "openml-test-database-setup"
volumes:
- ./docker/database/update.sh:/database-update.sh
command: /bin/sh -c "/database-update.sh"
depends_on:
database:
condition: service_healthy

docs:
profiles: ["all"]
build:
context: .
dockerfile: docker/docs/Dockerfile
Expand All @@ -16,8 +36,35 @@ services:
volumes:
- .:/docs

elasticsearch:
profiles: ["php", "all"]
image: docker.elastic.co/elasticsearch/elasticsearch:6.8.23
container_name: "openml-elasticsearch"
platform: "linux/amd64"
ports:
- "9200:9200" # also known as /es (nginx)
- "9300:9300"
env_file: docker/elasticsearch/.env
healthcheck:
test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
start_period: 30s
start_interval: 5s
timeout: 3s
interval: 10s
Comment on lines +51 to +53
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Same start_interval issue in the Elasticsearch healthcheck configuration.

As with the database service, start_interval isn’t a valid Docker healthcheck option and will be ignored. To control timing, rely on start_period with interval, or just adjust interval.

Comment on lines +48 to +53
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Elasticsearch healthcheck may be too strict for single-node clusters.

The healthcheck only accepts "green" status, but single-node Elasticsearch clusters typically report "yellow" because replicas cannot be allocated. This may cause the healthcheck to fail indefinitely.

Proposed fix to accept both green and yellow
     healthcheck:
-      test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
+      test: curl -s 127.0.0.1:9200/_cluster/health | grep -E '"status":"(green|yellow)"'
       start_period: 30s
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
healthcheck:
test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
start_period: 30s
start_interval: 5s
timeout: 3s
interval: 10s
healthcheck:
test: curl -s 127.0.0.1:9200/_cluster/health | grep -E '"status":"(green|yellow)"'
start_period: 30s
start_interval: 5s
timeout: 3s
interval: 10s
🤖 Prompt for AI Agents
In docker-compose.yaml around lines 48 to 53, the healthcheck currently only
accepts "green" which causes single-node Elasticsearch clusters reporting
"yellow" to fail; update the healthcheck command to consider both "green" and
"yellow" as healthy (for example use a silent curl and a regex match for
green|yellow or parse the JSON "status" field and check it is either "green" or
"yellow") so the healthcheck returns success for single-node scenarios while
still failing for red.

deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.2'
memory: 250M

php-api:
image: "openml/php-rest-api"
profiles: ["php", "all"]
image: "openml/php-rest-api:v1.2.2"
container_name: "openml-php-rest-api"
env_file: docker/php/.env
ports:
- "8002:80"
depends_on:
Expand All @@ -33,7 +80,8 @@ services:
interval: 1m

python-api:
container_name: "python-api"
profiles: ["python", "all"]
container_name: "openml-python-rest-api"
build:
context: .
dockerfile: docker/python/Dockerfile
Expand All @@ -43,20 +91,3 @@ services:
- .:/python-api
depends_on:
- database

elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:6.8.23
container_name: "elasticsearch"
ports:
- "9200:9200"
- "9300:9300"
environment:
- ELASTIC_PASSWORD=default
- discovery.type=single-node
- xpack.security.enabled=false
healthcheck:
test: curl 127.0.0.1:9200/_cluster/health | grep -e "green"
start_period: 30s
start_interval: 5s
timeout: 3s
interval: 1m
31 changes: 31 additions & 0 deletions docker/database/update.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#/bin/bash
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the shebang syntax.

The shebang is missing the ! character. This will cause the script to not execute with bash semantics and may fail or behave unexpectedly.

Proposed fix
-#/bin/bash
+#!/bin/bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#/bin/bash
#!/bin/bash
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

🤖 Prompt for AI Agents
In docker/database/update.sh around line 1, the shebang is written as
"#/bin/bash" and is missing the "!" — replace the incorrect shebang with a valid
bash shebang (#!/bin/bash) on the first line and ensure the script is executable
(chmod +x) so it runs with bash semantics.

# Change the filepath of openml.file
# from "https://www.openml.org/data/download/1666876/phpFsFYVN"
# to "http://minio:9000/datasets/0000/0001/phpFsFYVN"
mysql -hdatabase -uroot -pok -e 'UPDATE openml.file SET filepath = CONCAT("http://minio:9000/datasets/0000/", LPAD(id, 4, "0"), "/", SUBSTRING_INDEX(filepath, "/", -1)) WHERE extension="arff";'

# Update openml.expdb.dataset with the same url
mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;'
Comment on lines +5 to +8
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The MinIO dataset URL pattern here is inconsistent with the application’s routing logic and will fail for larger dataset IDs.

In src/core/formatting.py, dataset URLs use a computed ten-thousands prefix (dataset.did // 10_000:04d) plus a padded ID. This SQL instead hardcodes datasets/0000/ and only pads id to 4 digits, so id >= 10000 will map to the wrong path. If this script should work for all datasets, please compute both the prefix and padded ID in SQL (e.g., using integer division and LPAD) to match the Python logic.






# Create the data_feature_description TABLE. TODO: can we make sure this table exists already?
mysql -hdatabase -uroot -pok -Dopenml_expdb -e 'CREATE TABLE IF NOT EXISTS `data_feature_description` (
`did` int unsigned NOT NULL,
`index` int unsigned NOT NULL,
`uploader` mediumint unsigned NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`description_type` enum("plain", "ontology") NOT NULL,
`value` varchar(256) NOT NULL,
KEY `did` (`did`,`index`),
CONSTRAINT `data_feature_description_ibfk_1` FOREIGN KEY (`did`, `index`) REFERENCES `data_feature` (`did`, `index`) ON DELETE CASCADE ON UPDATE CASCADE
)'

# SET dataset 1 to active (used in unittests java)
mysql -hdatabase -uroot -pok -Dopenml_expdb -e 'INSERT IGNORE INTO dataset_status VALUES (1, "active", "2024-01-01 00:00:00", 1)'
mysql -hdatabase -uroot -pok -Dopenml_expdb -e 'DELETE FROM dataset_status WHERE did = 2 AND status = "deactivated";'

# Temporary fix in case the database missed the kaggle table. The PHP Rest API expects the table to be there, while indexing.
mysql -hdatabase -uroot -pok -Dopenml_expdb -e 'CREATE TABLE IF NOT EXISTS `kaggle` (`dataset_id` int(11) DEFAULT NULL, `kaggle_link` varchar(500) DEFAULT NULL)'
3 changes: 3 additions & 0 deletions docker/elasticsearch/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ELASTIC_PASSWORD=default
discovery.type=single-node
xpack.security.enabled=false
2 changes: 1 addition & 1 deletion docker/mysql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ which sets:

You should be able to connect to it using `mysql`:
```bash

mysql --host 127.0.0.1 --port 3306 -uroot -pok
```
If you do not have `mysql` installed, you may refer to the MySQL image documentation on
how to use the image instead to connect over a docker network if you want to connect
Expand Down
14 changes: 14 additions & 0 deletions docker/php/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
API_KEY=AD000000000000000000000000000000
BASE_URL=http://php-api:80/
MINIO_URL=http://minio:9000/
DB_HOST_OPENML=database:3306
DB_HOST_EXPDB=database:3306
DB_USER_OPENML=root
DB_PASS_OPENML=ok
DB_USER_EXPDB_WRITE=root
DB_PASS_EXPDB_WRITE=ok
DB_USER_EXPDB_READ=root
DB_PASS_EXPDB_READ=ok
ES_URL=elasticsearch:9200
ES_PASSWORD=default
INDEX_ES_DURING_STARTUP=false
Comment on lines +1 to +14
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if .env files are properly ignored
echo "=== Checking .gitignore for .env patterns ==="
if [ -f .gitignore ]; then
    grep -n "\.env" .gitignore || echo "No .env patterns found in .gitignore"
else
    echo ".gitignore not found"
fi

echo -e "\n=== Checking if this .env file is tracked by git ==="
git ls-files docker/php/.env && echo "WARNING: .env file is tracked!" || echo "OK: .env file is not tracked"

echo -e "\n=== Searching for .env.example or similar template files ==="
find . -name ".env*" -type f | grep -v ".git" | head -20

echo -e "\n=== Checking git status of docker/php/.env ==="
git status docker/php/.env 2>/dev/null || echo "File not found or not in git"

Repository: openml/server-api

Length of output: 438


🏁 Script executed:

# Check the .gitignore content around line 126 to see the pattern
echo "=== .gitignore content around line 126 ==="
sed -n '120,130p' .gitignore

echo -e "\n=== Full .env patterns in .gitignore ==="
grep -n "env" .gitignore

echo -e "\n=== Check if docker directory has any .gitignore ==="
cat docker/.gitignore 2>/dev/null || echo "No docker/.gitignore"

echo -e "\n=== Check if docker/php has any .gitignore ==="
cat docker/php/.gitignore 2>/dev/null || echo "No docker/php/.gitignore"

Repository: openml/server-api

Length of output: 745


Remove docker/php/.env from version control and create a template file.

This file contains sensitive credentials (API keys, database passwords) and is currently tracked by git despite having .env in .gitignore. The pattern .env only matches files at the repository root, not docker/php/.env in subdirectories.

Actions needed:

  1. Update .gitignore to use **/.env or add explicit docker/**/.env to catch all .env files
  2. Create docker/php/.env.example with placeholder values for each variable
  3. Remove the committed file from git history: git rm --cached docker/php/.env
  4. Document the setup process in your README with instructions to copy the example file and populate real values
🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 4-4: [UnorderedKey] The DB_HOST_OPENML key should go before the MINIO_URL key

(UnorderedKey)


[warning] 5-5: [UnorderedKey] The DB_HOST_EXPDB key should go before the DB_HOST_OPENML key

(UnorderedKey)


[warning] 6-6: [UnorderedKey] The DB_USER_OPENML key should go before the MINIO_URL key

(UnorderedKey)


[warning] 7-7: [UnorderedKey] The DB_PASS_OPENML key should go before the DB_USER_OPENML key

(UnorderedKey)


[warning] 8-8: [UnorderedKey] The DB_USER_EXPDB_WRITE key should go before the DB_USER_OPENML key

(UnorderedKey)


[warning] 9-9: [UnorderedKey] The DB_PASS_EXPDB_WRITE key should go before the DB_PASS_OPENML key

(UnorderedKey)


[warning] 10-10: [UnorderedKey] The DB_USER_EXPDB_READ key should go before the DB_USER_EXPDB_WRITE key

(UnorderedKey)


[warning] 11-11: [UnorderedKey] The DB_PASS_EXPDB_READ key should go before the DB_PASS_EXPDB_WRITE key

(UnorderedKey)


[warning] 12-12: [UnorderedKey] The ES_URL key should go before the MINIO_URL key

(UnorderedKey)


[warning] 13-13: [UnorderedKey] The ES_PASSWORD key should go before the ES_URL key

(UnorderedKey)


[warning] 14-14: [UnorderedKey] The INDEX_ES_DURING_STARTUP key should go before the MINIO_URL key

(UnorderedKey)

🤖 Prompt for AI Agents
In docker/php/.env lines 1-14: this tracked .env contains secrets and must be
removed from VCS and replaced with a template; update .gitignore to match env
files in subdirectories (add pattern **/.env or docker/**/.env), create
docker/php/.env.example with placeholder values for each variable shown, run git
rm --cached docker/php/.env and commit the removal, and if you need to purge
secrets from repository history use a history-rewrite tool (git filter-repo or
BFG) to remove docker/php/.env from past commits; finally add README
instructions telling developers to copy docker/php/.env.example to
docker/php/.env and populate real values.

27 changes: 1 addition & 26 deletions docker/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,39 +7,14 @@ This directory contains the files and information to build the following 5 image
- docs: the official [mkdocs-material](https://hub.docker.com/r/squidfunk/mkdocs-material)
image but with additional plugins installed required for building the documentation
in this project's `/doc` directory.
- [openml/php-rest-api](https://hub.docker.com/r/openml/php-rest-api): image with the
php back-end code, but ran on [feature/elasticsearch8](https://github.com/openml/openml/tree/feature/elasticsearch8)
branch.
- python-api: an image of this project, to facilitate development on any platform.
- [openml/elasticsearch8-prebuilt](https://hub.docker.com/r/openml/elasticsearch8-prebuilt):
the default elasticsearch image, but with indices already built on the test database
through invocation of the old php code.

Between the prebuilt indices and the baked-in database, when all images have already been
pulled, a `docker compose up` step should only take seconds. 🚀

## Building `openml/elasticsearch8-prebuilt`
The `openml/elasticsearch8-prebuilt` is not made with a Dockerfile, because it requires
steps of running containers, which to the best of my knowledge is not facilitated by
docker (not even through [multi-stage builds](https://docs.docker.com/build/building/multi-stage/)).
So, instead we build the container state locally and then use [`docker commit`](https://docs.docker.com/engine/reference/commandline/commit/).

1. run `docker compose up`, but with the `elasticsearch` service pointing to
`docker.elastic.co/elasticsearch/elasticsearch:8.10.4` instead of `openml/elasticsearch8-prebuilt`.
2. build the indices from the `php-api` container:

1. Connect to the container: `docker exec -it server-api-php-api-1 /bin/bash`
2. (optional) Edit `/var/www/openml/index.php` and set L56 to `development` instead of `production`,
this will show progress of building the indices, or print out any error that may occur.
3. Build the indices: `php /var/www/openml/index.php cron build_es_indices`
4. Exit the container with `exit`.

3. Make a commit of the elastic search container with prebuilt indices: `docker commit elasticsearch openml/elasticsearch8-prebuilt`
4. Push the image created by the commit: `docker push openml/elasticsearch8-prebuilt`

## Building for multiple platforms

Following Docker's "[multi-platform images](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiTutyczsOCAxUUhv0HHe_VA6QQFnoECBAQAQ&url=https%3A%2F%2Fdocs.docker.com%2Fbuild%2Fbuilding%2Fmulti-platform%2F&usg=AOvVaw0YP_mkj5WTYD-0weEfrfDv&opi=89978449)"
Following Docker's "[multi-platform images](https://docs.docker.com/build/building/multi-platform/)"
documentation, we can build multi-platform images in a few simple steps:

1. Only the first time, create a docker-container driver: `docker buildx create --name container --driver=docker-container`
Expand Down
14 changes: 12 additions & 2 deletions src/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

TomlTable = dict[str, typing.Any]

CONFIG_PATH = Path(__file__).parent / "config.toml"


def _apply_defaults_to_siblings(configuration: TomlTable) -> TomlTable:
defaults = configuration["defaults"]
Expand All @@ -19,9 +21,17 @@ def _apply_defaults_to_siblings(configuration: TomlTable) -> TomlTable:


@functools.cache
def load_database_configuration(file: Path = Path(__file__).parent / "config.toml") -> TomlTable:
configuration = tomllib.loads(file.read_text())
def _load_configuration(file: Path) -> TomlTable:
return typing.cast(TomlTable, tomllib.loads(file.read_text()))


def load_routing_configuration(file: Path = CONFIG_PATH) -> TomlTable:
return typing.cast(TomlTable, _load_configuration(file)["routing"])


@functools.cache
def load_database_configuration(file: Path = CONFIG_PATH) -> TomlTable:
configuration = _load_configuration(file)
database_configuration = _apply_defaults_to_siblings(
configuration["databases"],
)
Expand Down
4 changes: 4 additions & 0 deletions src/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,7 @@ database="openml_expdb"

[databases.openml]
database="openml"

[routing]
minio_url="http://minio:9000/"
server_url="http://php-api:80/"
Comment on lines +19 to +21
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find Python files that might load configuration
find . -name "*.py" -type f | head -20

# Search for toml loading patterns
rg -n "toml|config" --type=py -i | head -30

# Check the src/config.toml file
if [ -f "src/config.toml" ]; then
  cat -n src/config.toml
fi

# Search for environment variable patterns in Python files
rg -n "environ|getenv|os.env" --type=py | head -20

Repository: openml/server-api

Length of output: 4101


Routing configuration lacks environment variable override support.

The routing URLs use Docker service names (minio:9000, php-api:80), which are appropriate for containerized environments. However, unlike the database configuration which supports environment variable overrides (e.g., OPENML_DATABASES_OPENML_USERNAME), the load_routing_configuration() function does not implement similar environment variable override support for the routing URLs. For non-containerized deployments (local development, production), consider adding environment variable override support for minio_url and server_url to the configuration loading logic.

🤖 Prompt for AI Agents
In src/config.toml around lines 19–21 and the corresponding
load_routing_configuration() loader, the routing block hardcodes minio_url and
server_url and lacks environment variable override support; update the loader to
check for environment variables (e.g. OPENML_ROUTING_MINIO_URL and
OPENML_ROUTING_SERVER_URL) and, if present and non-empty, use their values
instead of the toml values, validating basic URL format and preserving the toml
defaults when the env vars are absent or invalid so non-containerized
deployments can override these endpoints easily.

13 changes: 7 additions & 6 deletions src/core/formatting.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from sqlalchemy.engine import Row

from config import load_configuration
from config import load_routing_configuration
from core.errors import DatasetError
from schemas.datasets.openml import DatasetFileFormat

Expand All @@ -25,15 +25,16 @@ def _format_parquet_url(dataset: Row) -> str | None:
if dataset.format.lower() != DatasetFileFormat.ARFF:
return None

minio_base_url = load_configuration()["minio_base_url"]
prefix = dataset.did // 10_000
return f"{minio_base_url}/datasets/{prefix:04d}/{dataset.did:04d}/dataset_{dataset.did}.pq"
minio_base_url = load_routing_configuration()["minio_url"]
ten_thousands_prefix = f"{dataset.did // 10_000:04d}"
padded_id = f"{dataset.did:04d}"
return f"{minio_base_url}datasets/{ten_thousands_prefix}/{padded_id}/dataset_{dataset.did}.pq"


def _format_dataset_url(dataset: Row) -> str:
base_url = load_configuration()["arff_base_url"]
base_url = load_routing_configuration()["server_url"]
filename = f"{html.escape(dataset.name)}.{dataset.format.lower()}"
return f"{base_url}/data/v1/download/{dataset.file_id}/{filename}"
return f"{base_url}data/v1/download/{dataset.file_id}/{filename}"


def _safe_unquote(text: str | None) -> str | None:
Expand Down
6 changes: 3 additions & 3 deletions src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,18 @@ def _parse_args() -> argparse.Namespace:
"uvicorn",
"arguments forwarded to uvicorn",
)
uvicorn_options.add_argument(
_ = uvicorn_options.add_argument(
"--reload",
action="store_true",
help="Enable auto-reload",
)
uvicorn_options.add_argument(
_ = uvicorn_options.add_argument(
"--host",
default="127.0.0.1",
type=str,
help="Bind socket to this host.",
)
uvicorn_options.add_argument(
_ = uvicorn_options.add_argument(
"--port",
default=8000,
type=int,
Expand Down
10 changes: 2 additions & 8 deletions src/routers/openml/flows.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,8 @@ def get_flow(flow_id: int, expdb: Annotated[Connection, Depends(expdb_connection
]

tags = database.flows.get_tags(flow_id, expdb)
flow_rows = database.flows.get_subflows(for_flow=flow_id, expdb=expdb)
subflows = [
{
"identifier": flow.identifier,
"flow": get_flow(flow_id=flow.child_id, expdb=expdb),
}
for flow in flow_rows
]
flow_rows = database.flows.get_subflows(flow_id, expdb)
subflows = [get_flow(flow_id=flow.child_id, expdb=expdb) for flow in flow_rows]
Comment on lines +52 to +53
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential infinite recursion with circular subflow references.

The recursive call to get_flow for each subflow has no cycle detection. If the database contains a circular reference (flow A → flow B → flow A), this will cause infinite recursion and a stack overflow.

Proposed fix with cycle detection
 @router.get("/{flow_id}")
-def get_flow(flow_id: int, expdb: Annotated[Connection, Depends(expdb_connection)] = None) -> Flow:
+def get_flow(
+    flow_id: int,
+    expdb: Annotated[Connection, Depends(expdb_connection)] = None,
+    _visited: set[int] | None = None,
+) -> Flow:
+    if _visited is None:
+        _visited = set()
+    if flow_id in _visited:
+        raise HTTPException(status_code=HTTPStatus.INTERNAL_SERVER_ERROR, detail="Circular subflow reference detected")
+    _visited.add(flow_id)
+
     flow = database.flows.get(flow_id, expdb)
     ...
     flow_rows = database.flows.get_subflows(flow_id, expdb)
-    subflows = [get_flow(flow_id=flow.child_id, expdb=expdb) for flow in flow_rows]
+    subflows = [get_flow(flow_id=flow.child_id, expdb=expdb, _visited=_visited.copy()) for flow in flow_rows]

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/routers/openml/flows.py around lines 52 to 53, the list comprehension
calls get_flow recursively for each subflow with no cycle detection which can
cause infinite recursion for circular subflow references; update the code to
track visited flow IDs (e.g., pass a visited set or list into get_flow or
maintain one in the caller), check each child_id against visited before
recursing, skip or return a placeholder for already-seen IDs (and optionally log
a warning), and ensure the visited set is copied or restored appropriately so
sibling branches are handled correctly to prevent stack overflow while
preserving the existing return shape.


return Flow(
id_=flow.id,
Expand Down
4 changes: 3 additions & 1 deletion src/routers/openml/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from fastapi import APIRouter, Depends, HTTPException
from sqlalchemy import Connection, RowMapping, text

import config
import database.datasets
import database.tasks
from routers.dependencies import expdb_connection
Expand Down Expand Up @@ -139,7 +140,8 @@ def _fill_json_template(
# I believe that the operations below are always part of string output, so
# we don't need to be careful to avoid losing typedness
template = template.replace("[TASK:id]", str(task.task_id))
return template.replace("[CONSTANT:base_url]", "https://test.openml.org/")
server_url = config.load_routing_configuration()["server_url"]
return template.replace("[CONSTANT:base_url]", server_url)


@router.get("/{task_id}")
Expand Down
2 changes: 1 addition & 1 deletion src/schemas/datasets/mldcat_ap.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ class DataService(JsonLDObject):


class JsonLDGraph(BaseModel):
context: str | dict[str, HttpUrl] = Field(default_factory=dict, serialization_alias="@context")
context: str | dict[str, HttpUrl] = Field(default_factory=dict, serialization_alias="@context") # type: ignore[arg-type]
graph: list[Distribution | DataService | Dataset | Quality | Feature | Agent | MD5Checksum] = (
Field(default_factory=list, serialization_alias="@graph")
)
Expand Down
Loading