
Setting Up Your AI Infrastructure Foundation
Building a self-hosted AI model inference server puts you in control of your machine learning workloads. No more expensive cloud APIs or rate limits strangling your applications. You can deploy models like Llama 2, Code Llama, or Mistral directly on your hardware, cutting costs while boosting privacy.
Your server will process inference requests smoothly and deliver consistent response times. We'll use Ollama for model management, NVIDIA Container Toolkit for GPU acceleration, and configure load balancing to spread requests across multiple inference instances.
You'll need a HostMyCode VPS with at least 16GB RAM and GPU support if possible. This setup scales from single-model deployments to multi-model serving clusters.
Installing Ollama for Self-Hosted AI Model Management
Ollama makes running large language models locally straightforward. It downloads models, handles quantization, and serves a REST API for inference requests.
Install Ollama on your Ubuntu server:
curl -fsSL https://ollama.ai/install.sh | sh
Check the installation:
ollama --version
Configure Ollama to start automatically:
sudo systemctl enable ollama
sudo systemctl start ollama
Download your first model. Llama 2 7B strikes a good balance between performance and resource usage:
ollama pull llama2:7b
Test the model:
ollama run llama2:7b "Write a Python function to calculate factorial"
The model loads into memory and processes your request. Response speed depends on your hardware specs and model size.
Configuring NVIDIA Container Toolkit for GPU Acceleration
GPU acceleration transforms inference speed. The NVIDIA Container Toolkit lets Docker containers tap into GPU resources efficiently.
Install Docker if you haven't already:
sudo apt update
sudo apt install -y docker.io docker-compose
sudo systemctl enable docker
sudo systemctl start docker
Add NVIDIA package repositories:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
Install the toolkit:
sudo apt update
sudo apt install -y nvidia-container-toolkit
Configure Docker to use NVIDIA runtime:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify GPU access works in containers:
sudo docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi
This should display your GPU details, confirming proper toolkit configuration.
Deploying Ollama with Docker and GPU Support
Docker containers give you better isolation and resource management for Ollama. Create a Docker Compose setup:
mkdir ~/ollama-cluster
cd ~/ollama-cluster
nano docker-compose.yml
Add this configuration:
version: '3.8'
services:
ollama-1:
image: ollama/ollama:latest
container_name: ollama-instance-1
ports:
- "11434:11434"
volumes:
- ollama-data-1:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
environment:
- OLLAMA_HOST=0.0.0.0
ollama-2:
image: ollama/ollama:latest
container_name: ollama-instance-2
ports:
- "11435:11434"
volumes:
- ollama-data-2:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
environment:
- OLLAMA_HOST=0.0.0.0
volumes:
ollama-data-1:
ollama-data-2:
Launch the containers:
docker-compose up -d
Download models to each instance:
docker exec ollama-instance-1 ollama pull llama2:7b
docker exec ollama-instance-2 ollama pull llama2:7b
Test both instances:
curl http://localhost:11434/api/tags
curl http://localhost:11435/api/tags
Each instance maintains its own model cache and inference queue. This provides redundancy and scales your inference capacity.
Implementing Load Balancing with Nginx
Load balancing spreads incoming requests across multiple Ollama instances. Nginx delivers efficient upstream load balancing with health checks.
Install Nginx:
sudo apt install -y nginx
Create the load balancer config:
sudo nano /etc/nginx/sites-available/ollama-lb
Add this configuration:
upstream ollama_backend {
least_conn;
server 127.0.0.1:11434 weight=1 max_fails=3 fail_timeout=30s;
server 127.0.0.1:11435 weight=1 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name your-domain.com;
client_max_body_size 100M;
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
location /api/ {
proxy_pass http://ollama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location /health {
access_log off;
return 200 "healthy";
add_header Content-Type text/plain;
}
}
Activate the configuration:
sudo ln -s /etc/nginx/sites-available/ollama-lb /etc/nginx/sites-enabled/
sudo rm /etc/nginx/sites-enabled/default
sudo nginx -t
sudo systemctl reload nginx
Test load balancing:
curl -X POST http://your-domain.com/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2:7b",
"prompt": "Explain quantum computing",
"stream": false
}'
Nginx automatically routes requests between backend instances using the least connections algorithm. Failed instances get temporarily removed from rotation.
For production deployments, HostMyCode managed VPS hosting includes server monitoring and automated backups for your AI infrastructure.
Monitoring and Performance Optimization
Keep tabs on your self-hosted AI model inference server to maintain peak performance. Watch request latency, GPU utilization, memory usage, and error rates.
Install htop for basic system monitoring:
sudo apt install -y htop
Monitor GPU usage:
nvidia-smi -l 1
Check Docker container resources:
docker stats
Create a monitoring script:
nano monitor-ollama.sh
#!/bin/bash
echo "=== Ollama Health Check ==="
curl -s http://localhost/health
echo ""
echo "=== Backend Status ==="
curl -s http://localhost:11434/api/tags | jq '.models | length'
curl -s http://localhost:11435/api/tags | jq '.models | length'
echo "=== GPU Memory Usage ==="
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
echo "=== Container Memory Usage ==="
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"
Make it executable and run:
chmod +x monitor-ollama.sh
./monitor-ollama.sh
For deeper monitoring, integrate tools like Beszel for modern server monitoring with clean dashboards and minimal resource overhead.
Securing Your AI Inference Server
Security matters when exposing AI services publicly. Add authentication, rate limiting, and access controls to protect your infrastructure.
Set up basic authentication in Nginx:
sudo apt install -y apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd admin
Update the Nginx configuration to require authentication:
location /api/ {
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://ollama_backend;
# ... rest of proxy configuration
}
Add rate limiting:
http {
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/m;
server {
location /api/ {
limit_req zone=api burst=5 nodelay;
# ... rest of configuration
}
}
}
Configure UFW firewall:
sudo ufw enable
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
For comprehensive security, follow the complete guide on hardening your Linux VPS for production which covers SSH, UFW, Fail2Ban, and additional security measures.
Scaling and Multi-Model Deployment
Your self-hosted AI model inference server can handle multiple models at once. Different models shine at specific tasks—Code Llama excels at programming, Mistral handles general queries well, and specialized models tackle domain-specific work.
Download additional models:
docker exec ollama-instance-1 ollama pull codellama:7b
docker exec ollama-instance-1 ollama pull mistral:7b
docker exec ollama-instance-2 ollama pull codellama:7b
docker exec ollama-instance-2 ollama pull mistral:7b
Set up model-specific endpoints in Nginx:
location /api/llama/ {
rewrite ^/api/llama/(.*)$ /api/$1 break;
proxy_pass http://ollama_backend;
# ... proxy headers
}
location /api/codellama/ {
rewrite ^/api/codellama/(.*)$ /api/$1 break;
proxy_pass http://ollama_backend;
# ... proxy headers
}
Test different models:
# General query
curl -X POST http://your-domain.com/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "llama2:7b", "prompt": "Explain machine learning", "stream": false}'
# Code generation
curl -X POST http://your-domain.com/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "codellama:7b", "prompt": "Write a Python class for user management", "stream": false}'
Track resource usage across models. Some models demand more VRAM or processing power. Adjust your instance allocation based on actual usage patterns.
API Integration and Client Applications
Your self-hosted AI model inference server provides REST API endpoints compatible with OpenAI's format. This compatibility makes integration with existing applications painless.
Create a Python client example:
nano test-client.py
import requests
import json
def query_ollama(prompt, model="llama2:7b"):
url = "http://your-domain.com/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.7,
"num_predict": 100
}
}
response = requests.post(url, json=payload)
if response.status_code == 200:
return response.json()["response"]
else:
return f"Error: {response.status_code}"
# Test the client
result = query_ollama("Write a haiku about servers")
print(result)
Run the client:
python3 test-client.py
Node.js applications work just as well with similar integration. The standardized API format ensures compatibility across programming languages and frameworks.
Consider deploying client applications using HostMyCode Node.js hosting for complete full-stack AI-powered solutions.
Ready to build your own self-hosted AI model inference server? HostMyCode VPS hosting provides the computational power and flexibility you need for AI workloads. Our managed VPS solutions include server monitoring, automated backups, and 24/7 support to keep your AI infrastructure running smoothly.
What are the hardware requirements for running AI models locally?
You need at least 16GB RAM and a modern CPU with AVX2 support. For better performance, use GPUs with 8GB+ VRAM. Larger models like Llama 2 13B require 32GB+ system RAM or high-memory GPUs. NVMe SSDs speed up model loading significantly.
How does Ollama compare to other AI model serving solutions?
Ollama wins on simplicity compared to solutions like TensorRT or Triton. It handles model quantization automatically, provides a clean REST API, and manages memory efficiently. Enterprise solutions offer more advanced features like batching and multi-GPU scaling, but with added complexity.
Can I run multiple different AI models simultaneously?
Yes, Ollama supports running multiple models at once. Each model uses GPU memory, so make sure you have enough VRAM. You can load models on-demand or keep popular models resident in memory. Load balancing spreads requests across available model instances.
What's the difference between streaming and non-streaming responses?
Streaming responses show real-time token generation, making long outputs feel faster. Non-streaming waits for complete generation before returning results. Streaming needs WebSocket support and more complex client handling but creates a better user experience.
How do I troubleshoot GPU memory errors?
GPU memory errors happen when models exceed available VRAM. Try smaller model variants (7B instead of 13B), enable model quantization, or add more GPU memory. Monitor usage with nvidia-smi and consider model offloading for memory management.
What security considerations apply to self-hosted AI servers?
Add authentication, rate limiting, and network firewalls. AI models can generate harmful content, so include content filtering. Secure API endpoints with HTTPS, monitor usage patterns, and limit access to authorized users. Keep security updates current.
How do I optimize inference performance?
Performance optimization includes proper model quantization, GPU acceleration, tuned batch sizes, and efficient load balancing. Monitor GPU utilization, memory usage, and request latency. Consider model-specific optimizations like speculative decoding for faster generation speeds.