A production-ready demonstration of GPU-accelerated AI inference using LangChain4j with ONNX Runtime and CUDA, deployed to Azure Container Apps with GPU support.
- GPU-Accelerated Inference: CUDA 12.2 with ONNX Runtime for high-performance AI
- Stable Diffusion Image Generation: Powered by Oracle's SD4J (Stable Diffusion for Java) 🎉
- Complete CLIP tokenizer, U-Net, VAE decoder, and scheduler implementation
- Multiple scheduler algorithms (LMS, Euler Ancestral)
- Optional NSFW safety checker
- Generate high-quality cartoon-style images from text prompts
- Text Embeddings: Semantic similarity with All-MiniLM-L6-v2 model via LangChain4j
- Modern Stack: Java 21 virtual threads, Spring Boot 3.2.5, LangChain4j 0.34.0, SD4J
- Cloud-Ready: Containerized with Docker, deployable to Azure Container Apps
- Interactive UI: Web-based interface with real-time metrics
- Production-Grade: Health checks, graceful shutdown, monitoring endpoints
👉 See SETUP.md for detailed step-by-step installation instructions, including:
- Prerequisites and build tools installation
- Model downloads and directory structure
- ONNX Runtime Extensions build process
- Local development and Azure deployment
- Comprehensive troubleshooting guide
- Architecture
- Prerequisites
- Quick Start
- Local Development
- Azure Deployment
- API Documentation
- Configuration
- Troubleshooting
- Performance
- Contributing
┌─────────────────────────────────────────────────────────────┐
│ Web UI │
│ (HTML + Tailwind CSS + JavaScript) │
└──────────────────────┬──────────────────────────────────────┘
│ REST API
┌──────────────────────▼──────────────────────────────────────┐
│ Spring Boot Controller │
│ (ImageController.java) │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────┐
│ LangChain4j GPU Service │
│ (LangChain4jGpuService.java) │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────┐
│ ONNX Runtime (GPU) │
│ • Stable Diffusion v1.5 (Image Gen) │
│ • All-MiniLM-L6-v2 (Embeddings) │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────┐
│ NVIDIA CUDA 12.2 │
│ (GPU Acceleration Layer) │
└─────────────────────────────────────────────────────────────┘
- Java 21 LTS (with
--enable-previewflag) - Maven 3.9+
- NVIDIA GPU with CUDA 12.2+ support
- CUDA Toolkit 12.2
- Docker (optional, for containerization)
- Git
- Azure CLI (Install)
- Azure Subscription with GPU quota enabled
- Docker for building images
📘 For detailed setup instructions, see SETUP.md
git clone https://github.com/your-org/gpuonazure.git
cd gpuonazure# Linux
sudo apt-get update && sudo apt-get install -y cmake build-essential
# macOS
brew install cmake./download-missing-models.shDownloads Stable Diffusion v1.5 models (~5.2 GB):
- Text Encoder, U-Net, VAE Decoder
./download-ortextensions.shBuilds libortextensions.so (~3 MB, required for CLIP tokenizer).
mvn spring-boot:runOpen browser: http://localhost:8080
🎨 Generate your first image!
📘 For troubleshooting and advanced configuration, see SETUP.md
mvn spring-boot:run# Build image
docker build -t gpu-langchain4j-demo:latest .
# Run with GPU (if CUDA available)
docker run --gpus all -p 8080:8080 \
-v $(pwd)/models:/app/models \
gpu-langchain4j-demo:latestEdit src/main/resources/application.yml:
gpu:
langchain4j:
gpu:
enabled: true # Set to false for CPU-only mode
device-id: 0
model:
dir: ./modelscurl http://localhost:8080/actuator/healthExpected Response:
{
"status": "UP",
"gpuAvailable": false,
"modelsLoaded": true,
"stableDiffusion": "Ready (SD4J)"
}Deploy to Azure Container Apps with GPU in one command:
./deploy-azure-aca.shThis script will:
- ✅ Create Azure Container Registry
- ✅ Build and push Docker image (~5GB)
- ✅ Create Container Apps Environment with GPU
- ✅ Deploy application
- ✅ Output application URL
Total time: 15-20 minutes
Customize deployment with environment variables:
export RESOURCE_GROUP="gpu-demo-rg"
export LOCATION="eastus"
export ACR_NAME="gpudemoregistry"
export GPU_PROFILE="NC8as_T4_v3" # or NC24ads_A100_v4
./deploy-azure-aca.shGPU Options:
NC8as_T4_v3- NVIDIA T4 (16GB), $0.526/hourNC24ads_A100_v4- NVIDIA A100 (80GB), $3.672/hour
For detailed instructions, see:
- AZURE-DEPLOYMENT-GUIDE.md - Complete deployment guide
- AZURE-QUICK-REFERENCE.md - Quick reference commands
# Get application URL
APP_URL=$(az containerapp show --name gpu-langchain4j-demo --resource-group gpu-demo-rg --query properties.configuration.ingress.fqdn -o tsv)
# Health check
curl https://$APP_URL/actuator/health
# Generate test image
curl -X POST https://$APP_URL/api/langchain4j/image \
-H "Content-Type: application/json" \
-d '{"prompt": "sunset over mountains", "style": "CLASSIC"}' \
--output test-azure.pngaz group delete --name gpu-demo-rg --yeshttp://localhost:8080/api/langchain4j
POST /image
Generate a cartoon-style image from text prompt using Stable Diffusion.
Request:
{
"prompt": "A friendly robot helping with Azure deployment",
"style": "HAPPY"
}Response: image/png (binary)
cURL Example:
curl -X POST http://localhost:8080/api/langchain4j/image \
-H "Content-Type: application/json" \
-d '{"prompt":"A friendly robot at a computer","style":"CLASSIC"}' \
--output generated-image.pngPOST /embeddings
Compare semantic similarity between two text snippets.
Request:
{
"text1": "Azure Container Apps",
"text2": "Cloud container platform"
}Response:
{
"text1": "Azure Container Apps",
"text2": "Cloud container platform",
"similarity": 0.8532,
"processingTimeMs": 45
}cURL Example:
curl -X POST http://localhost:8080/api/langchain4j/embeddings \
-H "Content-Type: application/json" \
-d '{
"text1": "GPU acceleration",
"text2": "CUDA processing"
}'GET /health
Get system health and GPU status.
Response:
{
"status": "UP",
"gpuAvailable": true,
"modelsLoaded": true,
"timestamp": "2025-09-29T12:00:00Z"
}GET /models
Get information about loaded models.
Response:
{
"models": {
"stableDiffusion": {
"name": "Stable Diffusion v1.5",
"path": "/app/models/stable-diffusion/model.onnx",
"available": true,
"sizeBytes": 3442332160
},
"allMiniLmL6V2": {
"name": "All-MiniLM-L6-v2",
"path": "/app/models/all-MiniLM-L6-v2/model.onnx",
"available": true,
"sizeBytes": 90000000
}
},
"gpuAvailable": true,
"modelsLoaded": true
}| Variable | Default | Description |
|---|---|---|
GPU_LANGCHAIN4J_GPU_ENABLED |
true |
Enable GPU acceleration |
GPU_LANGCHAIN4J_GPU_DEVICE_ID |
0 |
CUDA device ID |
GPU_LANGCHAIN4J_MODEL_DIR |
./models |
Model directory path |
JAVA_OPTS |
-Xmx8g -XX:+UseZGC |
JVM options |
SPRING_PROFILES_ACTIVE |
default |
Spring profile |
gpu:
langchain4j:
gpu:
enabled: true
device-id: 0
inter-op-threads: 4
intra-op-threads: 8gpu:
langchain4j:
model:
dir: /app/models
download:
enabled: true
azure-storage-account: mystorageaccount
azure-storage-container: onnx-models📘 For comprehensive troubleshooting, see SETUP.md
Error: Failed to load library ./libortextensions.so
Solution:
# Build the library (takes ~10 minutes)
./download-ortextensions.shError: Stable Diffusion model not found
Solution:
# Download models and organize structure
./download-missing-models.sh
# Verify structure
ls -R models/stable-diffusion/Error: OutOfMemoryError during generation
Solution:
# Increase JVM heap (requires 6-8GB for CPU mode)
export JAVA_OPTS="-Xmx8g -XX:+UseZGC"
mvn spring-boot:runCPU Mode Expected Times:
- First generation: ~60 seconds (model loading)
- Subsequent: ~30-45 seconds per image
Optimization:
# Reduce inference steps for faster generation
gpu:
langchain4j:
inference-steps: 20 # Default 40# Application logs
mvn spring-boot:run --debug
# Azure Container Apps logs
az containerapp logs show --name gpu-langchain4j-app \
--resource-group gpu-langchain4j-rg --follow| Operation | Average Time | Throughput |
|---|---|---|
| Image Generation (512x512) | 2.3s | ~0.43 img/s |
| Text Embedding (single) | 15ms | ~66 req/s |
| Text Embedding (batch of 10) | 45ms | ~222 req/s |
| Operation | Average Time | Throughput |
|---|---|---|
| Image Generation (512x512) | 0.8s | ~1.25 img/s |
| Text Embedding (single) | 8ms | ~125 req/s |
| Text Embedding (batch of 10) | 25ms | ~400 req/s |
- HTTPS: Enabled by default in Azure Container Apps
- CORS: Configured for web UI access
- Secrets: Use Azure Key Vault for sensitive configuration
- Authentication: Add Azure AD authentication for production
This project is licensed under the MIT License - see LICENSE file.
Contributions welcome! Please read CONTRIBUTING.md first.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing) - Open a Pull Request
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: support@example.com
- LangChain4j - Java AI framework
- ONNX Runtime - Cross-platform inference engine
- Stable Diffusion - Image generation model
- Sentence Transformers - Text embeddings
- Add more ONNX models (BERT, GPT, etc.)
- Implement model quantization for faster inference
- Add batch processing endpoints
- Support for multi-GPU deployment
- Implement caching layer with Redis
- Add Prometheus metrics export
- Create Helm chart for Kubernetes deployment
Made with ❤️ using Java 21, Spring Boot, LangChain4j, and Azure
