
Prerequisites and System Requirements
You'll need a Rocky Linux 9 server with at least 4GB RAM and a CUDA-compatible GPU for optimal performance. This tutorial assumes you have root access and basic familiarity with Linux command line operations.
First, ensure your system has the latest updates:
sudo dnf update -y
sudo dnf groupinstall "Development Tools" -y
sudo dnf install python3 python3-pip python3-devel nvidia-driver nvidia-cuda-toolkit -y
Verify your NVIDIA GPU is properly detected:
nvidia-smi
The command should display your GPU information. Without proper GPU drivers, the inference pipeline will fall back to CPU processing, which significantly impacts performance for large models.
Install Python Dependencies and Virtual Environment
Create a dedicated directory for your ML inference project:
mkdir ~/ml-inference-pipeline
cd ~/ml-inference-pipeline
Set up a Python virtual environment to isolate dependencies:
python3 -m venv venv
source venv/bin/activate
Install the core dependencies for our ONNX Runtime and FastAPI setup:
pip install fastapi uvicorn[standard] onnxruntime-gpu numpy pillow python-multipart
pip install transformers torch torchvision
The onnxruntime-gpu package provides CUDA acceleration. If you encounter installation issues, fall back to onnxruntime for CPU-only inference.
Convert Your Model to ONNX Format
Most ML frameworks can export models to ONNX format. Here's how to convert a PyTorch image classification model:
import torch
import torchvision.models as models
# Load a pre-trained ResNet model
model = models.resnet50(pretrained=True)
model.eval()
# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224)
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"resnet50.onnx",
export_params=True,
opset_version=11,
input_names=['input'],
output_names=['output']
)
Save this as convert_model.py and run it to generate your ONNX model file. The resulting resnet50.onnx file contains the optimized model ready for inference.
Build the FastAPI Inference Service
Create main.py with the core inference logic:
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import onnxruntime as ort
import numpy as np
from PIL import Image
import io
import json
app = FastAPI(title="AI/ML Inference Pipeline")
# Enable CORS for web applications
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Initialize ONNX Runtime session with GPU provider
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("resnet50.onnx", providers=providers)
def preprocess_image(image_bytes):
"""Preprocess uploaded image for model inference"""
image = Image.open(io.BytesIO(image_bytes))
image = image.convert('RGB')
image = image.resize((224, 224))
# Normalize pixel values
image_array = np.array(image).astype(np.float32) / 255.0
# Apply ImageNet normalization
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
image_array = (image_array - mean) / std
# Add batch dimension and transpose to NCHW format
image_array = np.transpose(image_array, (2, 0, 1))
image_array = np.expand_dims(image_array, axis=0)
return image_array
@app.get("/")
async def root():
return {"message": "AI/ML Inference Pipeline is running"}
@app.get("/health")
async def health_check():
return {"status": "healthy", "gpu_available": "CUDAExecutionProvider" in session.get_providers()}
@app.post("/predict")
async def predict(file: UploadFile = File(...)):
try:
# Read and preprocess image
image_bytes = await file.read()
input_tensor = preprocess_image(image_bytes)
# Run inference
input_name = session.get_inputs()[0].name
result = session.run(None, {input_name: input_tensor})
# Get top predictions
predictions = result[0][0]
top_indices = np.argsort(predictions)[::-1][:5]
response = {
"predictions": [
{"class_id": int(idx), "confidence": float(predictions[idx])}
for idx in top_indices
]
}
return response
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
This FastAPI application handles image uploads, preprocesses them for the model, and returns predictions with confidence scores. The ONNX Runtime automatically uses GPU acceleration when available.
Configure Production-Ready Deployment
Create a systemd service file for automatic startup and process management. Save this as /etc/systemd/system/ml-inference.service:
[Unit]
Description=ML Inference Pipeline
After=network.target
[Service]
User=root
WorkingDirectory=/root/ml-inference-pipeline
Environment=PATH=/root/ml-inference-pipeline/venv/bin
ExecStart=/root/ml-inference-pipeline/venv/bin/uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable ml-inference
sudo systemctl start ml-inference
Check the service status to ensure it's running correctly:
sudo systemctl status ml-inference
Set up Nginx Reverse Proxy with SSL
Install Nginx and obtain SSL certificates for secure HTTPS access:
sudo dnf install nginx certbot python3-certbot-nginx -y
Create an Nginx configuration file at /etc/nginx/conf.d/ml-inference.conf:
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Increase timeouts for large model inference
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# Handle large file uploads
client_max_body_size 50M;
}
}
Enable Nginx and obtain SSL certificates:
sudo systemctl enable nginx
sudo systemctl start nginx
sudo certbot --nginx -d your-domain.com
The Let's Encrypt certificate will automatically configure HTTPS and set up renewal.
Performance Optimization and Monitoring
Monitor GPU utilization and memory usage during inference operations:
watch -n 1 nvidia-smi
For high-throughput scenarios, configure multiple Uvicorn worker processes. Edit your systemd service file to adjust the --workers parameter based on your GPU memory capacity.
Enable request logging by adding middleware to your FastAPI application:
import time
from fastapi import Request
@app.middleware("http")
async def log_requests(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
print(f"{request.method} {request.url} - {response.status_code} - {process_time:.4f}s")
return response
This helps identify performance bottlenecks and monitor inference latency across different model inputs.
For production deployments handling high request volumes, consider implementing Supervisor process management alongside systemd for additional process monitoring capabilities.
Edge Computing Optimizations
Edge deployments require careful resource management. Configure ONNX Runtime to use specific GPU memory allocation:
# Add to your main.py before creating the session
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Limit GPU memory usage for edge devices
cuda_provider_options = {
'device_id': 0,
'arena_extend_strategy': 'kNextPowerOfTwo',
'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB limit
}
providers = [('CUDAExecutionProvider', cuda_provider_options), 'CPUExecutionProvider']
session = ort.InferenceSession("resnet50.onnx", session_options, providers=providers)
This configuration prevents memory exhaustion on resource-constrained edge devices while maintaining acceptable inference performance.
Implement model quantization for reduced memory footprint:
from onnxruntime.quantization import quantize_dynamic, QuantType
# Quantize model for edge deployment
quantize_dynamic(
"resnet50.onnx",
"resnet50_quantized.onnx",
weight_type=QuantType.QUInt8
)
Quantized models use significantly less memory and provide faster inference on edge hardware, though with slight accuracy trade-offs.
Testing Your Inference Pipeline
Create a simple test script to validate your deployment:
import requests
import json
# Test health endpoint
response = requests.get("https://your-domain.com/health")
print(f"Health check: {response.json()}")
# Test prediction with sample image
with open("test_image.jpg", "rb") as f:
files = {"file": f}
response = requests.post("https://your-domain.com/predict", files=files)
print(f"Prediction: {response.json()}")
Successful responses indicate your pipeline is processing requests correctly. Monitor the response times to ensure they meet your application requirements.
Ready to deploy your AI/ML inference pipeline in production? HostMyCode VPS provides GPU-enabled servers with Rocky Linux 9 support, perfect for machine learning workloads. Our managed VPS hosting includes 24/7 support to help optimize your AI applications.
Frequently Asked Questions
What GPU memory is required for ONNX Runtime inference?
Memory requirements depend on your model size. A ResNet-50 model typically needs 1-2GB GPU memory, while larger transformer models may require 8GB or more. Monitor GPU memory usage with nvidia-smi during inference operations.
Can I run multiple models simultaneously on the same server?
Yes, but GPU memory becomes the limiting factor. Load models on-demand or use model rotation strategies. Configure separate FastAPI instances on different ports for isolated model serving.
How do I handle different input formats beyond images?
Modify the preprocessing function in your FastAPI application. For text inputs, implement tokenization. For numerical data, ensure proper normalization and tensor reshaping match your model's expected input format.
What's the performance difference between GPU and CPU inference?
GPU acceleration typically provides 5-10x speedup for deep learning models, especially with batch processing. CPU inference may be sufficient for simple models or low-throughput applications on edge devices.
How can I optimize inference latency for real-time applications?
Use model quantization, batch processing for multiple requests, and keep models loaded in memory. Consider TensorRT optimization for NVIDIA GPUs or ONNX Runtime execution providers specific to your hardware.