From ab922a23de90368d024fb33a9385c8373d658beb Mon Sep 17 00:00:00 2001 From: Debanjan Maji Date: Mon, 17 Nov 2025 08:48:59 +0000 Subject: [PATCH] Increase PHP upload/post limits to 5G and extend timeout; add docs for large dataset uploads; Fixes #12223 --- LARGE_DATASET_UPLOAD_FIX.md | 265 ++++++++++++++++++++++++++++++++++++ QUICK_FIX_OVERFLOW_ERROR.md | 79 +++++++++++ docker/README.md | 11 ++ docker/config/php.ini | 6 +- 4 files changed, 358 insertions(+), 3 deletions(-) create mode 100644 LARGE_DATASET_UPLOAD_FIX.md create mode 100644 QUICK_FIX_OVERFLOW_ERROR.md diff --git a/LARGE_DATASET_UPLOAD_FIX.md b/LARGE_DATASET_UPLOAD_FIX.md new file mode 100644 index 00000000..ebd65c92 --- /dev/null +++ b/LARGE_DATASET_UPLOAD_FIX.md @@ -0,0 +1,265 @@ +# Large Dataset Upload Fix + +## Problem Summary +Users attempting to upload large datasets (2.7GB+) were encountering: +1. **Server-side rejection**: PHP upload limits too restrictive (2MB max) +2. **Client-side OverflowError**: Python SSL limitation when sending >2GB as single buffer + +## Server-Side Fix (COMPLETED ✓) + +### Changes Made to `/docker/config/php.ini`: + +| Setting | Old Value | New Value | Purpose | +|---------|-----------|-----------|---------| +| `upload_max_filesize` | 2M | **5G** | Maximum size per uploaded file | +| `post_max_size` | 8M | **5G** | Maximum total POST request size | +| `max_execution_time` | 30 | **3600** | Maximum script runtime (1 hour) | +| `memory_limit` | 16G | 16G | Already sufficient ✓ | + +### Deployment Required +After making these changes, you **must restart** the OpenML Docker container: +```bash +docker-compose down +docker-compose up -d --build +``` + +Or if using plain Docker: +```bash +docker stop +docker start +``` + +--- + +## Client-Side Issue (Still Needs Addressing) + +### The OverflowError Explained +``` +OverflowError: string longer than 2147483647 bytes +``` + +**Root cause**: Python's SSL layer uses a signed 32-bit integer for write buffer length. This limits a single `send()` call to 2,147,483,647 bytes (2^31-1 ≈ 2GB). + +**Why it happens**: The `openml-python` client or `requests` library may be: +1. Reading the entire 2.7GB file into memory as one bytes object +2. Building the entire multipart POST body in memory +3. Attempting to send it in one SSL write operation + +### Solutions for Client-Side + +#### Option 1: Stream the Upload (RECOMMENDED) +Modify how the file is passed to the OpenML client. Instead of: +```python +# BAD - loads entire file into memory +with open('dataset.arff', 'rb') as f: + data = f.read() # 2.7GB in RAM! + openml_dataset.publish() # triggers OverflowError +``` + +Use streaming (requires patching openml-python or using direct requests): +```python +# GOOD - streams in chunks +import requests + +with open('dataset.arff', 'rb') as f: + files = {'dataset': ('dataset.arff', f)} # Pass file handle, not bytes + response = requests.post( + 'https://openml.org/api/v1/data', + files=files, + data={'api_key': 'YOUR_KEY', 'description': xml_description} + ) +``` + +**Note**: If `openml-python` internally calls `f.read()`, you'll need to patch it or use Option 2/3. + +#### Option 2: Compress Before Upload +Reduce file size below 2GB: +```bash +# ARFF supports gzip compression +gzip dataset.arff +# Result: dataset.arff.gz (often 10-50x smaller for sparse data) +``` + +Then upload the `.arff.gz` file. OpenML should accept compressed ARFF. + +#### Option 3: Host Externally and Register by URL +Upload to a service that handles large files: +- **Zenodo**: Free, DOI-based, handles 50GB+ +- **AWS S3**: Pay-per-use, unlimited size +- **Institutional repository**: Check your university + +Then register the dataset in OpenML by URL: +```python +import openml + +dataset = openml.datasets.OpenMLDataset( + name="My Large Dataset", + description="...", + url="https://zenodo.org/record/12345/files/dataset.arff.gz", + format="arff", + version_label="1.0" +) +dataset.publish() +``` + +#### Option 4: Patch openml-python +If you control the client environment, patch the library to use streaming: + +**File to patch**: `/openml/_api_calls.py` + +Find the section that builds `file_elements` and ensure it passes file handles, not bytes: +```python +# In _perform_api_call or _read_url_files +# BEFORE (bad): +file_data = open(filepath, 'rb').read() # Loads all into memory +file_elements = {'dataset': (filename, file_data)} + +# AFTER (good): +file_handle = open(filepath, 'rb') # Keep handle open +file_elements = {'dataset': (filename, file_handle)} +``` + +--- + +## Testing Your Fix + +### Server-Side Test +1. Check PHP configuration is loaded: + ```bash + docker exec php -i | grep -E 'upload_max_filesize|post_max_size|max_execution_time' + ``` + Should show: `upload_max_filesize => 5G`, `post_max_size => 5G`, `max_execution_time => 3600` + +2. Try a test upload via curl: + ```bash + curl -X POST https://your-openml-server.org/api/v1/data \ + -F "api_key=YOUR_KEY" \ + -F "description=@description.xml" \ + -F "dataset=@test_large_file.arff" + ``` + +### Client-Side Test +1. Try uploading a 1GB file first (below the 2GB SSL limit) +2. Monitor memory usage: `htop` or Task Manager +3. If successful, the client is streaming properly +4. For 2.7GB files, use compression or external hosting + +--- + +## Recommended Workflow for 2.7GB Dataset + +**Best approach combining all solutions:** + +1. **Compress the dataset** (reduces transfer time and bypasses SSL limit): + ```bash + gzip -9 dataset.arff # Maximum compression + ``` + +2. **Verify server config** (already fixed in this repo): + - Restart Docker container to load new php.ini + +3. **Upload via direct HTTP streaming** (bypass openml-python client): + ```python + import requests + + api_key = "YOUR_API_KEY" + url = "https://openml.org/api/v1/data" + + # Prepare XML description + xml_desc = """ + + Dataset Name + Description here + arff + """ + + # Stream upload + with open('dataset.arff.gz', 'rb') as f: + response = requests.post( + url, + data={'api_key': api_key, 'description': xml_desc}, + files={'dataset': ('dataset.arff.gz', f)}, + timeout=3600 # 1 hour timeout for large uploads + ) + + print(response.text) + ``` + +4. **Monitor upload progress** (optional): + ```python + from tqdm import tqdm + import requests + + # Wrapper for progress bar + class TqdmUploader: + def __init__(self, filename): + self.filename = filename + self.size = os.path.getsize(filename) + self.progress = tqdm(total=self.size, unit='B', unit_scale=True) + + def __enter__(self): + self.f = open(self.filename, 'rb') + return self + + def __exit__(self, *args): + self.f.close() + self.progress.close() + + def read(self, size=-1): + chunk = self.f.read(size) + self.progress.update(len(chunk)) + return chunk + + with TqdmUploader('dataset.arff.gz') as uploader: + response = requests.post(url, files={'dataset': uploader}, ...) + ``` + +--- + +## Additional Considerations + +### Web Server Configuration +If you're using **nginx** as a reverse proxy (not present in current setup), also add: +```nginx +client_max_body_size 5G; +proxy_read_timeout 3600s; +``` + +### Network Timeouts +For very large uploads over slow connections: +- **Client timeout**: Set `timeout=7200` in requests (2 hours) +- **Server timeout**: Already set via `max_execution_time = 3600` +- **Load balancer timeout**: Check cloud provider settings (AWS ALB, GCP LB, etc.) + +### Storage Space +Uploading 2.7GB datasets requires adequate disk space: +- **Temporary space**: `/tmp` needs ~2.7GB during upload +- **Final storage**: `DATA_PATH` needs ~2.7GB per dataset +- **Recommend**: 50GB+ free space on server + +### Alternative: Split Dataset +If all else fails, consider splitting into multiple smaller datasets: +```python +# Split dataset into chunks +import pandas as pd + +df = pd.read_csv('dataset.csv') +chunk_size = 1_000_000 # 1M rows per chunk + +for i, start in enumerate(range(0, len(df), chunk_size)): + chunk = df[start:start + chunk_size] + chunk.to_csv(f'dataset_part{i}.arff', index=False, header=True) + # Upload each part separately +``` + +--- + +## Summary + +✅ **Server-side limits fixed** (this repo) +⚠️ **Client-side requires**: +- File compression (easiest) +- Streaming upload (most robust) +- External hosting (most flexible) + +**For your 2.7GB file**: Compress with gzip first, should reduce to <500MB for typical datasets. diff --git a/QUICK_FIX_OVERFLOW_ERROR.md b/QUICK_FIX_OVERFLOW_ERROR.md new file mode 100644 index 00000000..dea08b16 --- /dev/null +++ b/QUICK_FIX_OVERFLOW_ERROR.md @@ -0,0 +1,79 @@ +# Quick Fix: OverflowError on Large Dataset Upload + +## Error You're Seeing +``` +OverflowError: string longer than 2147483647 bytes +``` + +## Immediate Solutions (Pick One) + +### Solution 1: Compress Your Dataset (EASIEST) ⭐ +```bash +gzip -9 your_dataset.arff +``` +This typically reduces file size by 80-95% for sparse datasets. Upload the `.arff.gz` file instead. + +### Solution 2: Use Direct HTTP Upload (MOST RELIABLE) +Replace your `publish_dataset.py` with this: + +```python +import requests +import os + +# Configuration +API_KEY = "your_api_key_here" +DATASET_FILE = "your_dataset.arff" # or .arff.gz +DATASET_NAME = "Your Dataset Name" +DATASET_DESCRIPTION = "Description of your dataset" + +# Create XML description +xml_description = f""" + + {DATASET_NAME} + {DATASET_DESCRIPTION} + arff +""" + +# Upload with streaming (no memory overflow) +print(f"Uploading {DATASET_FILE} ({os.path.getsize(DATASET_FILE) / 1e9:.2f} GB)...") +with open(DATASET_FILE, 'rb') as f: + response = requests.post( + 'https://www.openml.org/api/v1/data', + data={ + 'api_key': API_KEY, + 'description': xml_description + }, + files={'dataset': (os.path.basename(DATASET_FILE), f)}, + timeout=7200 # 2 hour timeout + ) + +print(response.status_code) +print(response.text) +``` + +### Solution 3: Host Externally (BEST FOR VERY LARGE FILES) +1. Upload to Zenodo, Figshare, or S3 +2. Get the permanent URL +3. Register in OpenML: + +```python +import openml + +dataset = openml.datasets.OpenMLDataset( + name="Your Dataset Name", + description="Your description", + url="https://zenodo.org/record/XXXXX/files/dataset.arff.gz", + format="arff" +) +dataset.publish() +``` + +## Why This Happens + +1. **Python limitation**: SSL write buffer cannot exceed 2GB (signed 32-bit int max) +2. **Client bug**: openml-python loads entire file into memory instead of streaming +3. **Server limits**: Default OpenML server limits were 2MB (now fixed to 5GB) + +## Need More Help? + +See [LARGE_DATASET_UPLOAD_FIX.md](./LARGE_DATASET_UPLOAD_FIX.md) for complete details. diff --git a/docker/README.md b/docker/README.md index 5454e95e..b254d615 100644 --- a/docker/README.md +++ b/docker/README.md @@ -26,3 +26,14 @@ Note that the protocol is `http` not `https`. ```bash docker build --tag openml/php-rest-api -f docker/Dockerfile . ``` + +## Upload Limits + +The server is configured to support large dataset uploads: +- **Maximum upload size**: 5GB per file +- **Maximum POST size**: 5GB +- **Execution timeout**: 3600 seconds (1 hour) + +These limits are set in `docker/config/php.ini`. If you need to change them, modify the file and rebuild the container. + +For uploading very large datasets (>2GB), see [LARGE_DATASET_UPLOAD_FIX.md](../LARGE_DATASET_UPLOAD_FIX.md) for client-side considerations. diff --git a/docker/config/php.ini b/docker/config/php.ini index 6f97900d..c5b80e66 100644 --- a/docker/config/php.ini +++ b/docker/config/php.ini @@ -416,7 +416,7 @@ expose_php = On ; Maximum execution time of each script, in seconds ; https://php.net/max-execution-time ; Note: This directive is hardcoded to 0 for the CLI SAPI -max_execution_time = 30 +max_execution_time = 3600 ; Maximum amount of time each script may spend parsing request data. It's a good ; idea to limit this time on productions servers in order to eliminate unexpectedly @@ -710,7 +710,7 @@ auto_globals_jit = On ; Its value may be 0 to disable the limit. It is ignored if POST data reading ; is disabled through enable_post_data_reading. ; https://php.net/post-max-size -post_max_size = 8M +post_max_size = 5G ; Automatically add files before PHP document. ; https://php.net/auto-prepend-file @@ -862,7 +862,7 @@ file_uploads = On ; Maximum allowed size for uploaded files. ; https://php.net/upload-max-filesize -upload_max_filesize = 2M +upload_max_filesize = 5G ; Maximum number of files that can be uploaded via a single request max_file_uploads = 20