Skip to content
/ miof Public

Automated testing and recommendation system for deploying classic ML models efficiently in an air-gapped enterprise Kubernetes/OpenShift environment.

Notifications You must be signed in to change notification settings

wexler/miof

Repository files navigation

MIOF - ML Inference Optimization Framework

Automated testing and recommendation system for deploying classic ML models efficiently in an air-gapped enterprise Kubernetes/OpenShift environment.

Current date reference: February 2026

Overview

MIOF helps central ML operations teams evaluate models submitted by various departments and automatically recommend the best inference backend, precision, and hardware configuration for each model.

Supported use cases:

  • Latency-critical real-time inference
  • Throughput-oriented batch scoring
  • Models built with scikit-learn, XGBoost, TensorFlow, or PyTorch

The system:

  • Accepts model artifacts + test dataset via shared storage paths
  • Converts models to ONNX where applicable
  • Tests across multiple backends (native, ONNX Runtime CPU/CUDA/TensorRT, etc.)
  • Measures quality (fidelity) and performance (latency / throughput / memory)
  • Ranks configurations according to the user-specified optimization goal
  • Outputs ranked recommendations with basic Kubernetes YAML snippets

Current status: MVP Feature Complete with Triton & INT8 support. Full inference testing, structured logging, GPU monitoring, and file-based storage implemented.

Key Features

Feature Status Notes
FastAPI submission endpoint ✅ Done /submit accepts JSON with paths and goal
Model loading (sklearn, xgboost, TF, Torch) ✅ Done Full loading with validation
ONNX conversion ✅ Done sklearn & xgboost working; TF/Torch supported
Backend config generation ✅ Done Dynamic based on goal, hardware, INT8, Triton
Quality & performance testing ✅ Done Full inference with fidelity checking
TensorRT compilation ✅ Done Calls trtexec with FP16/INT8 support
Triton Inference Server ✅ Done FIL backend for tree models, ONNX backend
INT8 quantization ✅ Done Static/dynamic ONNX quantization with calibration
Goal-based ranking (latency/throughput) ✅ Done Weighted scoring with SLA filtering
Kubernetes YAML recommendation ✅ Done Deployment snippets for each config
Structured logging ✅ Done JSON output with structlog
GPU memory monitoring ✅ Done pynvml-based VRAM tracking
Persistent results storage ✅ Done Timestamped output folders with JSON/YAML
Air-gapped compatibility ✅ Done No internet; pre-installed deps only
Parallel backend testing Planned Switch to Celery + Redis when volume increases
Auto-deployment Planned Future extension using kubernetes client-python

Project Structure

miof/
├── miof/
│   ├── __init__.py
│   ├── api.py                  # FastAPI application
│   ├── backends.py             # BackendConfig generator
│   ├── converter.py            # Model loading & ONNX export
│   ├── evaluator.py            # Ranking & YAML generation
│   ├── gpu_monitor.py          # GPU memory monitoring
│   ├── int8_calibration.py     # INT8 quantization utilities
│   ├── logging_config.py       # Structured logging (structlog)
│   ├── models.py               # Submission & config dataclasses
│   ├── orchestrator.py         # Main processing logic
│   ├── storage.py              # File-based output storage
│   ├── tester.py               # Quality & performance measurement
│   └── triton_client.py        # Triton Inference Server client
├── tests/                      # Pytest test suite
├── docker/
│   ├── Dockerfile.gpu          # GPU testing with CUDA/TensorRT
│   ├── Dockerfile.cpu          # CPU-only for CI/CD
│   └── docker-compose.gpu.yml  # Compose for GPU testing
├── deploy/
│   ├── miof-deployment.yaml    # Basic OpenShift/K8s deployment
│   └── README-deploy.md        # Deployment instructions
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Requirements (air-gapped)

All dependencies must be pre-downloaded and installed via wheels or internal mirror.

Core packages (versions approximate — match your environment):

  • fastapi
  • uvicorn
  • pydantic
  • pandas
  • numpy
  • scikit-learn
  • xgboost
  • tensorflow (or tensorflow-cpu if no GPU needed for conversion)
  • torch (with appropriate CUDA version if testing locally)
  • onnxruntime (with GPU support: onnxruntime-gpu)
  • onnxmltools
  • skl2onnx
  • tf2onnx (optional, for TF → ONNX)
  • psutil (for memory monitoring)

See requirements.txt for a starting point.

Docker Usage

MIOF provides Docker images for isolated testing environments.

CPU-Only Testing (CI/CD, Mac, no GPU)

# Build
docker build -f docker/Dockerfile.cpu -t miof-cpu .

# Run all tests
docker run -v $(pwd):/workspace miof-cpu

# Run specific tests
docker run -v $(pwd):/workspace miof-cpu pytest tests/test_e2e_pipeline.py -v -s

# Interactive shell
docker run -it -v $(pwd):/workspace miof-cpu bash

GPU Testing (NVIDIA)

Requires NVIDIA Docker runtime (nvidia-container-toolkit).

# Build
docker build -f docker/Dockerfile.gpu -t miof-gpu .

# Run all tests with GPU
docker run --gpus all -v $(pwd):/workspace miof-gpu

# Run E2E tests with verbose output
docker run --gpus all -v $(pwd):/workspace miof-gpu pytest tests/test_e2e_pipeline.py -v -s

# Interactive shell
docker run --gpus all -it -v $(pwd):/workspace miof-gpu bash

Docker Compose (GPU)

# Run all GPU tests
docker-compose -f docker/docker-compose.gpu.yml up --build

# Run with Triton Inference Server
docker-compose -f docker/docker-compose.gpu.yml --profile triton up --build

Verify GPU Providers

docker run --gpus all miof-gpu python -c "
import onnxruntime as ort
print('Available providers:', ort.get_available_providers())
"
# Expected: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

Quick Start

Local Development

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the API server
uvicorn miof.api:app --reload --host 0.0.0.0 --port 8000

Submit a Test Job

curl -X POST http://localhost:8000/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model_artifact_path": "/path/to/model.joblib",
    "framework": "sklearn",
    "test_dataset_path": "/path/to/data.csv",
    "optimization_goal": "latency",
    "sla_config": {"max_latency_ms": 50},
    "target_hardware": ["cpu"]
  }'

With INT8 Quantization

curl -X POST http://localhost:8000/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model_artifact_path": "/path/to/model.joblib",
    "framework": "sklearn",
    "test_dataset_path": "/path/to/data.csv",
    "calibration_dataset_path": "/path/to/calibration.csv",
    "enable_int8": true,
    "optimization_goal": "throughput",
    "target_hardware": ["gpu"]
  }'

Deploy to Kubernetes

# Using pre-built GPU image
docker build -f docker/Dockerfile.gpu -t miof-gpu .
docker tag miof-gpu your-registry/miof-gpu:latest
docker push your-registry/miof-gpu:latest

# Deploy
kubectl apply -f deploy/miof-deployment.yaml

Running Tests

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

# Run all tests
pytest tests/ -v

# Run E2E pipeline tests
pytest tests/test_e2e_pipeline.py -v -s

# Run with coverage
pytest --cov=miof --cov-report=html

Future Enhancements

  • Celery + Redis: Parallel backend testing for higher volume
  • Dashboard: Streamlit or Grafana for historical recommendations
  • Auto-deployment: kubernetes client-python for automated deployment
  • Model Registry: MLflow, Kubeflow integration

Contributing / Extending

  • Add new backends in backends.py
  • Extend conversion logic in converter.py
  • Customize scoring weights in evaluator.py
  • Add Triton model configs in triton_client.py
  • Integrate with existing monitoring (Prometheus, Grafana)

License

Internal company use only — no external license applied.

Questions / improvements → reach out to the ML Ops team.

Happy optimizing!

About

Automated testing and recommendation system for deploying classic ML models efficiently in an air-gapped enterprise Kubernetes/OpenShift environment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages