MIOF - ML Inference Optimization Framework

Automated testing and recommendation system for deploying classic ML models efficiently in an air-gapped enterprise Kubernetes/OpenShift environment.

Current date reference: February 2026

Overview

MIOF helps central ML operations teams evaluate models submitted by various departments and automatically recommend the best inference backend, precision, and hardware configuration for each model.

Supported use cases:

Latency-critical real-time inference
Throughput-oriented batch scoring
Models built with scikit-learn, XGBoost, TensorFlow, or PyTorch

The system:

Accepts model artifacts + test dataset via shared storage paths
Converts models to ONNX where applicable
Tests across multiple backends (native, ONNX Runtime CPU/CUDA/TensorRT, etc.)
Measures quality (fidelity) and performance (latency / throughput / memory)
Ranks configurations according to the user-specified optimization goal
Outputs ranked recommendations with basic Kubernetes YAML snippets

Current status: MVP Feature Complete with Triton & INT8 support. Full inference testing, structured logging, GPU monitoring, and file-based storage implemented.

Key Features

Feature	Status	Notes
FastAPI submission endpoint	✅ Done	`/submit` accepts JSON with paths and goal
Model loading (sklearn, xgboost, TF, Torch)	✅ Done	Full loading with validation
ONNX conversion	✅ Done	sklearn & xgboost working; TF/Torch supported
Backend config generation	✅ Done	Dynamic based on goal, hardware, INT8, Triton
Quality & performance testing	✅ Done	Full inference with fidelity checking
TensorRT compilation	✅ Done	Calls `trtexec` with FP16/INT8 support
Triton Inference Server	✅ Done	FIL backend for tree models, ONNX backend
INT8 quantization	✅ Done	Static/dynamic ONNX quantization with calibration
Goal-based ranking (latency/throughput)	✅ Done	Weighted scoring with SLA filtering
Kubernetes YAML recommendation	✅ Done	Deployment snippets for each config
Structured logging	✅ Done	JSON output with structlog
GPU memory monitoring	✅ Done	pynvml-based VRAM tracking
Persistent results storage	✅ Done	Timestamped output folders with JSON/YAML
Air-gapped compatibility	✅ Done	No internet; pre-installed deps only
Parallel backend testing	Planned	Switch to Celery + Redis when volume increases
Auto-deployment	Planned	Future extension using kubernetes client-python

Project Structure

miof/
├── miof/
│   ├── __init__.py
│   ├── api.py                  # FastAPI application
│   ├── backends.py             # BackendConfig generator
│   ├── converter.py            # Model loading & ONNX export
│   ├── evaluator.py            # Ranking & YAML generation
│   ├── gpu_monitor.py          # GPU memory monitoring
│   ├── int8_calibration.py     # INT8 quantization utilities
│   ├── logging_config.py       # Structured logging (structlog)
│   ├── models.py               # Submission & config dataclasses
│   ├── orchestrator.py         # Main processing logic
│   ├── storage.py              # File-based output storage
│   ├── tester.py               # Quality & performance measurement
│   └── triton_client.py        # Triton Inference Server client
├── tests/                      # Pytest test suite
├── docker/
│   ├── Dockerfile.gpu          # GPU testing with CUDA/TensorRT
│   ├── Dockerfile.cpu          # CPU-only for CI/CD
│   └── docker-compose.gpu.yml  # Compose for GPU testing
├── deploy/
│   ├── miof-deployment.yaml    # Basic OpenShift/K8s deployment
│   └── README-deploy.md        # Deployment instructions
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Requirements (air-gapped)

All dependencies must be pre-downloaded and installed via wheels or internal mirror.

Core packages (versions approximate — match your environment):

fastapi
uvicorn
pydantic
pandas
numpy
scikit-learn
xgboost
tensorflow (or tensorflow-cpu if no GPU needed for conversion)
torch (with appropriate CUDA version if testing locally)
onnxruntime (with GPU support: onnxruntime-gpu)
onnxmltools
skl2onnx
tf2onnx (optional, for TF → ONNX)
psutil (for memory monitoring)

See requirements.txt for a starting point.

Docker Usage

MIOF provides Docker images for isolated testing environments.

CPU-Only Testing (CI/CD, Mac, no GPU)

# Build
docker build -f docker/Dockerfile.cpu -t miof-cpu .

# Run all tests
docker run -v $(pwd):/workspace miof-cpu

# Run specific tests
docker run -v $(pwd):/workspace miof-cpu pytest tests/test_e2e_pipeline.py -v -s

# Interactive shell
docker run -it -v $(pwd):/workspace miof-cpu bash

GPU Testing (NVIDIA)

Requires NVIDIA Docker runtime (nvidia-container-toolkit).

# Build
docker build -f docker/Dockerfile.gpu -t miof-gpu .

# Run all tests with GPU
docker run --gpus all -v $(pwd):/workspace miof-gpu

# Run E2E tests with verbose output
docker run --gpus all -v $(pwd):/workspace miof-gpu pytest tests/test_e2e_pipeline.py -v -s

# Interactive shell
docker run --gpus all -it -v $(pwd):/workspace miof-gpu bash

Docker Compose (GPU)

# Run all GPU tests
docker-compose -f docker/docker-compose.gpu.yml up --build

# Run with Triton Inference Server
docker-compose -f docker/docker-compose.gpu.yml --profile triton up --build

Verify GPU Providers

docker run --gpus all miof-gpu python -c "
import onnxruntime as ort
print('Available providers:', ort.get_available_providers())
"
# Expected: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

Quick Start

Local Development

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the API server
uvicorn miof.api:app --reload --host 0.0.0.0 --port 8000

Submit a Test Job

curl -X POST http://localhost:8000/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model_artifact_path": "/path/to/model.joblib",
    "framework": "sklearn",
    "test_dataset_path": "/path/to/data.csv",
    "optimization_goal": "latency",
    "sla_config": {"max_latency_ms": 50},
    "target_hardware": ["cpu"]
  }'

With INT8 Quantization

curl -X POST http://localhost:8000/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model_artifact_path": "/path/to/model.joblib",
    "framework": "sklearn",
    "test_dataset_path": "/path/to/data.csv",
    "calibration_dataset_path": "/path/to/calibration.csv",
    "enable_int8": true,
    "optimization_goal": "throughput",
    "target_hardware": ["gpu"]
  }'

Deploy to Kubernetes

# Using pre-built GPU image
docker build -f docker/Dockerfile.gpu -t miof-gpu .
docker tag miof-gpu your-registry/miof-gpu:latest
docker push your-registry/miof-gpu:latest

# Deploy
kubectl apply -f deploy/miof-deployment.yaml

Running Tests

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

# Run all tests
pytest tests/ -v

# Run E2E pipeline tests
pytest tests/test_e2e_pipeline.py -v -s

# Run with coverage
pytest --cov=miof --cov-report=html

Future Enhancements

Celery + Redis: Parallel backend testing for higher volume
Dashboard: Streamlit or Grafana for historical recommendations
Auto-deployment: kubernetes client-python for automated deployment
Model Registry: MLflow, Kubeflow integration

Contributing / Extending

Add new backends in backends.py
Extend conversion logic in converter.py
Customize scoring weights in evaluator.py
Add Triton model configs in triton_client.py
Integrate with existing monitoring (Prometheus, Grafana)

License

Internal company use only — no external license applied.

Questions / improvements → reach out to the ML Ops team.

Happy optimizing!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.claude		.claude
deploy		deploy
docker		docker
miof		miof
tests		tests
.gitignore		.gitignore
IMPLEMENTATION_STATUS.md		IMPLEMENTATION_STATUS.md
README-design.md		README-design.md
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIOF - ML Inference Optimization Framework

Overview

Key Features

Project Structure

Requirements (air-gapped)

Docker Usage

CPU-Only Testing (CI/CD, Mac, no GPU)

GPU Testing (NVIDIA)

Docker Compose (GPU)

Verify GPU Providers

Quick Start

Local Development

Submit a Test Job

With INT8 Quantization

Deploy to Kubernetes

Running Tests

Future Enhancements

Contributing / Extending

License

About

Uh oh!

Releases

Packages

Languages

wexler/miof

Folders and files

Latest commit

History

Repository files navigation

MIOF - ML Inference Optimization Framework

Overview

Key Features

Project Structure

Requirements (air-gapped)

Docker Usage

CPU-Only Testing (CI/CD, Mac, no GPU)

GPU Testing (NVIDIA)

Docker Compose (GPU)

Verify GPU Providers

Quick Start

Local Development

Submit a Test Job

With INT8 Quantization

Deploy to Kubernetes

Running Tests

Future Enhancements

Contributing / Extending

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages