Building a Self-Hosted AI Model Inference Server with Ollama, NVIDIA Container Toolkit, and Load Balancing

Setting Up Your AI Infrastructure Foundation

Building a self-hosted AI model inference server puts you in control of your machine learning workloads. No more expensive cloud APIs or rate limits strangling your applications. You can deploy models like Llama 2, Code Llama, or Mistral directly on your hardware, cutting costs while boosting privacy.

Your server will process inference requests smoothly and deliver consistent response times. We'll use Ollama for model management, NVIDIA Container Toolkit for GPU acceleration, and configure load balancing to spread requests across multiple inference instances.

You'll need a HostMyCode VPS with at least 16GB RAM and GPU support if possible. This setup scales from single-model deployments to multi-model serving clusters.

Installing Ollama for Self-Hosted AI Model Management

Ollama makes running large language models locally straightforward. It downloads models, handles quantization, and serves a REST API for inference requests.

Install Ollama on your Ubuntu server:

curl -fsSL https://ollama.ai/install.sh | sh

Check the installation:

ollama --version

Configure Ollama to start automatically:

sudo systemctl enable ollama
sudo systemctl start ollama

Download your first model. Llama 2 7B strikes a good balance between performance and resource usage:

ollama pull llama2:7b

Test the model:

ollama run llama2:7b "Write a Python function to calculate factorial"

The model loads into memory and processes your request. Response speed depends on your hardware specs and model size.

Configuring NVIDIA Container Toolkit for GPU Acceleration

GPU acceleration transforms inference speed. The NVIDIA Container Toolkit lets Docker containers tap into GPU resources efficiently.

Install Docker if you haven't already:

sudo apt update
sudo apt install -y docker.io docker-compose
sudo systemctl enable docker
sudo systemctl start docker

Add NVIDIA package repositories:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Install the toolkit:

sudo apt update
sudo apt install -y nvidia-container-toolkit

Configure Docker to use NVIDIA runtime:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access works in containers:

sudo docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

This should display your GPU details, confirming proper toolkit configuration.

Deploying Ollama with Docker and GPU Support

Docker containers give you better isolation and resource management for Ollama. Create a Docker Compose setup:

mkdir ~/ollama-cluster
cd ~/ollama-cluster
nano docker-compose.yml

Add this configuration:

version: '3.8'
services:
  ollama-1:
    image: ollama/ollama:latest
    container_name: ollama-instance-1
    ports:
      - "11434:11434"
    volumes:
      - ollama-data-1:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    environment:
      - OLLAMA_HOST=0.0.0.0

  ollama-2:
    image: ollama/ollama:latest
    container_name: ollama-instance-2
    ports:
      - "11435:11434"
    volumes:
      - ollama-data-2:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    environment:
      - OLLAMA_HOST=0.0.0.0

volumes:
  ollama-data-1:
  ollama-data-2:

Launch the containers:

docker-compose up -d

Download models to each instance:

docker exec ollama-instance-1 ollama pull llama2:7b
docker exec ollama-instance-2 ollama pull llama2:7b

Test both instances:

curl http://localhost:11434/api/tags
curl http://localhost:11435/api/tags

Each instance maintains its own model cache and inference queue. This provides redundancy and scales your inference capacity.

Implementing Load Balancing with Nginx

Load balancing spreads incoming requests across multiple Ollama instances. Nginx delivers efficient upstream load balancing with health checks.

Install Nginx:

sudo apt install -y nginx

Create the load balancer config:

sudo nano /etc/nginx/sites-available/ollama-lb

Add this configuration:

upstream ollama_backend {
    least_conn;
    server 127.0.0.1:11434 weight=1 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:11435 weight=1 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name your-domain.com;
    
    client_max_body_size 100M;
    proxy_read_timeout 300s;
    proxy_connect_timeout 75s;
    
    location /api/ {
        proxy_pass http://ollama_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # WebSocket support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
    
    location /health {
        access_log off;
        return 200 "healthy";
        add_header Content-Type text/plain;
    }
}

Activate the configuration:

sudo ln -s /etc/nginx/sites-available/ollama-lb /etc/nginx/sites-enabled/
sudo rm /etc/nginx/sites-enabled/default
sudo nginx -t
sudo systemctl reload nginx

Test load balancing:

curl -X POST http://your-domain.com/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "prompt": "Explain quantum computing",
    "stream": false
  }'

Nginx automatically routes requests between backend instances using the least connections algorithm. Failed instances get temporarily removed from rotation.

For production deployments, HostMyCode managed VPS hosting includes server monitoring and automated backups for your AI infrastructure.

Monitoring and Performance Optimization

Keep tabs on your self-hosted AI model inference server to maintain peak performance. Watch request latency, GPU utilization, memory usage, and error rates.

Install htop for basic system monitoring:

sudo apt install -y htop

Monitor GPU usage:

nvidia-smi -l 1

Check Docker container resources:

docker stats

Create a monitoring script:

nano monitor-ollama.sh

#!/bin/bash

echo "=== Ollama Health Check ==="
curl -s http://localhost/health
echo ""

echo "=== Backend Status ==="
curl -s http://localhost:11434/api/tags | jq '.models | length'
curl -s http://localhost:11435/api/tags | jq '.models | length'

echo "=== GPU Memory Usage ==="
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits

echo "=== Container Memory Usage ==="
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"

Make it executable and run:

chmod +x monitor-ollama.sh
./monitor-ollama.sh

For deeper monitoring, integrate tools like Beszel for modern server monitoring with clean dashboards and minimal resource overhead.

Securing Your AI Inference Server

Security matters when exposing AI services publicly. Add authentication, rate limiting, and access controls to protect your infrastructure.

Set up basic authentication in Nginx:

sudo apt install -y apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd admin

Update the Nginx configuration to require authentication:

location /api/ {
    auth_basic "Ollama API";
    auth_basic_user_file /etc/nginx/.htpasswd;
    
    proxy_pass http://ollama_backend;
    # ... rest of proxy configuration
}

Add rate limiting:

http {
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/m;
    
    server {
        location /api/ {
            limit_req zone=api burst=5 nodelay;
            # ... rest of configuration
        }
    }
}

Configure UFW firewall:

sudo ufw enable
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

For comprehensive security, follow the complete guide on hardening your Linux VPS for production which covers SSH, UFW, Fail2Ban, and additional security measures.

Scaling and Multi-Model Deployment

Your self-hosted AI model inference server can handle multiple models at once. Different models shine at specific tasks—Code Llama excels at programming, Mistral handles general queries well, and specialized models tackle domain-specific work.

Download additional models:

docker exec ollama-instance-1 ollama pull codellama:7b
docker exec ollama-instance-1 ollama pull mistral:7b
docker exec ollama-instance-2 ollama pull codellama:7b
docker exec ollama-instance-2 ollama pull mistral:7b

Set up model-specific endpoints in Nginx:

location /api/llama/ {
    rewrite ^/api/llama/(.*)$ /api/$1 break;
    proxy_pass http://ollama_backend;
    # ... proxy headers
}

location /api/codellama/ {
    rewrite ^/api/codellama/(.*)$ /api/$1 break;
    proxy_pass http://ollama_backend;
    # ... proxy headers
}

Test different models:

# General query
curl -X POST http://your-domain.com/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2:7b", "prompt": "Explain machine learning", "stream": false}'

# Code generation
curl -X POST http://your-domain.com/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "codellama:7b", "prompt": "Write a Python class for user management", "stream": false}'

Track resource usage across models. Some models demand more VRAM or processing power. Adjust your instance allocation based on actual usage patterns.

API Integration and Client Applications

Your self-hosted AI model inference server provides REST API endpoints compatible with OpenAI's format. This compatibility makes integration with existing applications painless.

Create a Python client example:

nano test-client.py

import requests
import json

def query_ollama(prompt, model="llama2:7b"):
    url = "http://your-domain.com/api/generate"
    
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_predict": 100
        }
    }
    
    response = requests.post(url, json=payload)
    
    if response.status_code == 200:
        return response.json()["response"]
    else:
        return f"Error: {response.status_code}"

# Test the client
result = query_ollama("Write a haiku about servers")
print(result)

Run the client:

python3 test-client.py

Node.js applications work just as well with similar integration. The standardized API format ensures compatibility across programming languages and frameworks.

Consider deploying client applications using HostMyCode Node.js hosting for complete full-stack AI-powered solutions.

Ready to build your own self-hosted AI model inference server? HostMyCode VPS hosting provides the computational power and flexibility you need for AI workloads. Our managed VPS solutions include server monitoring, automated backups, and 24/7 support to keep your AI infrastructure running smoothly.

What are the hardware requirements for running AI models locally?

You need at least 16GB RAM and a modern CPU with AVX2 support. For better performance, use GPUs with 8GB+ VRAM. Larger models like Llama 2 13B require 32GB+ system RAM or high-memory GPUs. NVMe SSDs speed up model loading significantly.

How does Ollama compare to other AI model serving solutions?

Ollama wins on simplicity compared to solutions like TensorRT or Triton. It handles model quantization automatically, provides a clean REST API, and manages memory efficiently. Enterprise solutions offer more advanced features like batching and multi-GPU scaling, but with added complexity.

Can I run multiple different AI models simultaneously?

Yes, Ollama supports running multiple models at once. Each model uses GPU memory, so make sure you have enough VRAM. You can load models on-demand or keep popular models resident in memory. Load balancing spreads requests across available model instances.

What's the difference between streaming and non-streaming responses?

Streaming responses show real-time token generation, making long outputs feel faster. Non-streaming waits for complete generation before returning results. Streaming needs WebSocket support and more complex client handling but creates a better user experience.

How do I troubleshoot GPU memory errors?

GPU memory errors happen when models exceed available VRAM. Try smaller model variants (7B instead of 13B), enable model quantization, or add more GPU memory. Monitor usage with nvidia-smi and consider model offloading for memory management.

What security considerations apply to self-hosted AI servers?

Add authentication, rate limiting, and network firewalls. AI models can generate harmful content, so include content filtering. Secure API endpoints with HTTPS, monitor usage patterns, and limit access to authorized users. Keep security updates current.

How do I optimize inference performance?

Performance optimization includes proper model quantization, GPU acceleration, tuned batch sizes, and efficient load balancing. Monitor GPU utilization, memory usage, and request latency. Consider model-specific optimizations like speculative decoding for faster generation speeds.