Automated testing and recommendation system for deploying classic ML models efficiently in an air-gapped enterprise Kubernetes/OpenShift environment.
Current date reference: February 2026
MIOF helps central ML operations teams evaluate models submitted by various departments and automatically recommend the best inference backend, precision, and hardware configuration for each model.
Supported use cases:
- Latency-critical real-time inference
- Throughput-oriented batch scoring
- Models built with scikit-learn, XGBoost, TensorFlow, or PyTorch
The system:
- Accepts model artifacts + test dataset via shared storage paths
- Converts models to ONNX where applicable
- Tests across multiple backends (native, ONNX Runtime CPU/CUDA/TensorRT, etc.)
- Measures quality (fidelity) and performance (latency / throughput / memory)
- Ranks configurations according to the user-specified optimization goal
- Outputs ranked recommendations with basic Kubernetes YAML snippets
Current status: MVP Feature Complete with Triton & INT8 support. Full inference testing, structured logging, GPU monitoring, and file-based storage implemented.
| Feature | Status | Notes |
|---|---|---|
| FastAPI submission endpoint | ✅ Done | /submit accepts JSON with paths and goal |
| Model loading (sklearn, xgboost, TF, Torch) | ✅ Done | Full loading with validation |
| ONNX conversion | ✅ Done | sklearn & xgboost working; TF/Torch supported |
| Backend config generation | ✅ Done | Dynamic based on goal, hardware, INT8, Triton |
| Quality & performance testing | ✅ Done | Full inference with fidelity checking |
| TensorRT compilation | ✅ Done | Calls trtexec with FP16/INT8 support |
| Triton Inference Server | ✅ Done | FIL backend for tree models, ONNX backend |
| INT8 quantization | ✅ Done | Static/dynamic ONNX quantization with calibration |
| Goal-based ranking (latency/throughput) | ✅ Done | Weighted scoring with SLA filtering |
| Kubernetes YAML recommendation | ✅ Done | Deployment snippets for each config |
| Structured logging | ✅ Done | JSON output with structlog |
| GPU memory monitoring | ✅ Done | pynvml-based VRAM tracking |
| Persistent results storage | ✅ Done | Timestamped output folders with JSON/YAML |
| Air-gapped compatibility | ✅ Done | No internet; pre-installed deps only |
| Parallel backend testing | Planned | Switch to Celery + Redis when volume increases |
| Auto-deployment | Planned | Future extension using kubernetes client-python |
miof/
├── miof/
│ ├── __init__.py
│ ├── api.py # FastAPI application
│ ├── backends.py # BackendConfig generator
│ ├── converter.py # Model loading & ONNX export
│ ├── evaluator.py # Ranking & YAML generation
│ ├── gpu_monitor.py # GPU memory monitoring
│ ├── int8_calibration.py # INT8 quantization utilities
│ ├── logging_config.py # Structured logging (structlog)
│ ├── models.py # Submission & config dataclasses
│ ├── orchestrator.py # Main processing logic
│ ├── storage.py # File-based output storage
│ ├── tester.py # Quality & performance measurement
│ └── triton_client.py # Triton Inference Server client
├── tests/ # Pytest test suite
├── docker/
│ ├── Dockerfile.gpu # GPU testing with CUDA/TensorRT
│ ├── Dockerfile.cpu # CPU-only for CI/CD
│ └── docker-compose.gpu.yml # Compose for GPU testing
├── deploy/
│ ├── miof-deployment.yaml # Basic OpenShift/K8s deployment
│ └── README-deploy.md # Deployment instructions
├── requirements.txt # Python dependencies
└── README.md # This file
All dependencies must be pre-downloaded and installed via wheels or internal mirror.
Core packages (versions approximate — match your environment):
- fastapi
- uvicorn
- pydantic
- pandas
- numpy
- scikit-learn
- xgboost
- tensorflow (or tensorflow-cpu if no GPU needed for conversion)
- torch (with appropriate CUDA version if testing locally)
- onnxruntime (with GPU support: onnxruntime-gpu)
- onnxmltools
- skl2onnx
- tf2onnx (optional, for TF → ONNX)
- psutil (for memory monitoring)
See requirements.txt for a starting point.
MIOF provides Docker images for isolated testing environments.
# Build
docker build -f docker/Dockerfile.cpu -t miof-cpu .
# Run all tests
docker run -v $(pwd):/workspace miof-cpu
# Run specific tests
docker run -v $(pwd):/workspace miof-cpu pytest tests/test_e2e_pipeline.py -v -s
# Interactive shell
docker run -it -v $(pwd):/workspace miof-cpu bashRequires NVIDIA Docker runtime (nvidia-container-toolkit).
# Build
docker build -f docker/Dockerfile.gpu -t miof-gpu .
# Run all tests with GPU
docker run --gpus all -v $(pwd):/workspace miof-gpu
# Run E2E tests with verbose output
docker run --gpus all -v $(pwd):/workspace miof-gpu pytest tests/test_e2e_pipeline.py -v -s
# Interactive shell
docker run --gpus all -it -v $(pwd):/workspace miof-gpu bash# Run all GPU tests
docker-compose -f docker/docker-compose.gpu.yml up --build
# Run with Triton Inference Server
docker-compose -f docker/docker-compose.gpu.yml --profile triton up --builddocker run --gpus all miof-gpu python -c "
import onnxruntime as ort
print('Available providers:', ort.get_available_providers())
"
# Expected: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run the API server
uvicorn miof.api:app --reload --host 0.0.0.0 --port 8000curl -X POST http://localhost:8000/submit \
-H "Content-Type: application/json" \
-d '{
"model_artifact_path": "/path/to/model.joblib",
"framework": "sklearn",
"test_dataset_path": "/path/to/data.csv",
"optimization_goal": "latency",
"sla_config": {"max_latency_ms": 50},
"target_hardware": ["cpu"]
}'curl -X POST http://localhost:8000/submit \
-H "Content-Type: application/json" \
-d '{
"model_artifact_path": "/path/to/model.joblib",
"framework": "sklearn",
"test_dataset_path": "/path/to/data.csv",
"calibration_dataset_path": "/path/to/calibration.csv",
"enable_int8": true,
"optimization_goal": "throughput",
"target_hardware": ["gpu"]
}'# Using pre-built GPU image
docker build -f docker/Dockerfile.gpu -t miof-gpu .
docker tag miof-gpu your-registry/miof-gpu:latest
docker push your-registry/miof-gpu:latest
# Deploy
kubectl apply -f deploy/miof-deployment.yaml# Create virtual environment
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txt
# Run all tests
pytest tests/ -v
# Run E2E pipeline tests
pytest tests/test_e2e_pipeline.py -v -s
# Run with coverage
pytest --cov=miof --cov-report=html- Celery + Redis: Parallel backend testing for higher volume
- Dashboard: Streamlit or Grafana for historical recommendations
- Auto-deployment: kubernetes client-python for automated deployment
- Model Registry: MLflow, Kubeflow integration
- Add new backends in
backends.py - Extend conversion logic in
converter.py - Customize scoring weights in
evaluator.py - Add Triton model configs in
triton_client.py - Integrate with existing monitoring (Prometheus, Grafana)
Internal company use only — no external license applied.
Questions / improvements → reach out to the ML Ops team.
Happy optimizing!