
A single VPS is easy to live with—right up until it falls over. A kernel panic, a bad upgrade, a noisy neighbor, or routine provider maintenance can turn “a quick reboot” into real downtime. If your service can run on two modest servers, you can buy a lot of resilience with a proven pattern: a floating IP managed by VRRP.
This post walks through Linux VPS failover with keepalived in an operations-first way. You’ll set up two VPS nodes to advertise one shared virtual IP, add health checks so only a healthy node can hold it, verify behavior under stress, and keep a rollback plan ready. The example uses Debian 12, Nginx, and systemd, but the same design translates cleanly to Ubuntu and most VPS/cloud networks that support an additional IP routed to your instances.
What you’re building (and what it’s not)
You’re setting up active/passive failover for one public endpoint. One node serves traffic at a time (MASTER). The other waits (BACKUP) and takes over the same IP if the MASTER becomes unhealthy.
- Good fit: a small API, landing page, webhook receiver, internal tool, or edge proxy.
- Not included: database replication. If your app writes to a DB, you still need a DB HA plan (or a managed DB).
- Reality check: failover can be quick (1–5 seconds), but not instant. Long-lived connections will reconnect.
If you want better visibility into the exact moment failover happens, pair this with observability stack architecture for microservices or a lighter-weight option like OpenTelemetry Collector monitoring agent setup.
Prerequisites (specific, not aspirational)
- Two VPS instances on the same L2 network or a provider-supported VRRP segment (often labeled “private network”, “VPC”, or “shared VLAN”).
- A floating IP / additional routed IP you can move between nodes, or an IP the provider routes to whichever MAC/instance announces it. Some platforms require an API call instead of VRRP; keepalived still works for internal VIPs.
- Debian 12 on both nodes (commands shown). Root or sudo access.
- Nginx installed on both nodes (or any service you’ll health-check).
- UDP port 112 (VRRP) allowed between the two nodes on the private interface.
Hosting note: for predictable networking and stable performance, run this on a VPS with dedicated resources and a provider that supports additional IPs. A HostMyCode VPS works well for small HA pairs, and you can move up to managed VPS hosting if you want patching and baseline hardening handled consistently.
Scenario details (so you can follow along exactly)
We’ll use:
- node-a: 10.10.40.11 (private), public IP irrelevant
- node-b: 10.10.40.12 (private)
- VIP (floating IP on private network): 10.10.40.50/24
- Private interface:
ens6on both nodes - Service: Nginx on port 80 (health check verifies local HTTP)
- VRRP instance name:
VI_40 - Virtual router ID:
40(must match on both nodes)
Adjust interface names and addresses to match your environment. Don’t guess the interface name—confirm it.
Step 1: Confirm networking and interface names
On both nodes:
ip -br a
ip r
Expected output looks like this (example):
lo UNKNOWN 127.0.0.1/8 ::1/128
ens6 UP 10.10.40.11/24
ens3 UP 203.0.113.10/24
default via 203.0.113.1 dev ens3
10.10.40.0/24 dev ens6 proto kernel scope link src 10.10.40.11
You need a reachable private subnet on both nodes (10.10.40.0/24 here). VRRP heartbeats should go over that private interface, not the public NIC.
Step 2: Install keepalived (and a small toolbox)
On both nodes:
sudo apt-get update
sudo apt-get install -y keepalived curl iproute2
Check version and service status:
keepalived -v
systemctl status keepalived --no-pager
If keepalived shows as inactive, that’s fine. It won’t do anything useful until you add a config.
Step 3: Create a health check that controls failover
VRRP alone only tells you whether a host is present. That’s not the same as “able to serve traffic.” You want failover to trigger when the service is broken, even if the kernel is still responding.
Create a script that exits non-zero if the local web endpoint fails. On both nodes:
sudo install -d -m 0755 /etc/keepalived/scripts
sudo tee /etc/keepalived/scripts/check_nginx_local.sh >/dev/null <<'EOF'
#!/bin/sh
# Fail if Nginx isn't active
systemctl is-active --quiet nginx || exit 2
# Fail if local HTTP doesn't return 200
code=$(curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1/healthz)
[ "$code" = "200" ] || exit 3
exit 0
EOF
sudo chmod 0755 /etc/keepalived/scripts/check_nginx_local.sh
Add a lightweight health endpoint in Nginx on both nodes. Create:
sudo tee /etc/nginx/conf.d/healthz.conf >/dev/null <<'EOF'
server {
listen 127.0.0.1:80;
server_name _;
location = /healthz {
access_log off;
add_header Content-Type text/plain;
return 200 "ok\n";
}
}
EOF
sudo nginx -t
sudo systemctl reload nginx
Verify the endpoint:
curl -i http://127.0.0.1/healthz
Expected:
HTTP/1.1 200 OK
Content-Type: text/plain
ok
Step 4: Configure keepalived on node-a (MASTER)
Edit /etc/keepalived/keepalived.conf on node-a:
sudo tee /etc/keepalived/keepalived.conf >/dev/null <<'EOF'
! Configuration File for keepalived
global_defs {
router_id node-a
enable_script_security
script_user root
}
vrrp_script chk_nginx {
script "/etc/keepalived/scripts/check_nginx_local.sh"
interval 2
timeout 2
fall 2
rise 2
weight -30
}
vrrp_instance VI_40 {
state MASTER
interface ens6
virtual_router_id 40
priority 120
advert_int 1
authentication {
auth_type PASS
auth_pass 9c7b3a2f0f
}
virtual_ipaddress {
10.10.40.50/24 dev ens6 label ens6:vip
}
track_script {
chk_nginx
}
notify_master "/usr/local/sbin/vrrp_notify.sh MASTER"
notify_backup "/usr/local/sbin/vrrp_notify.sh BACKUP"
notify_fault "/usr/local/sbin/vrrp_notify.sh FAULT"
}
EOF
Create the notify script (used for logging and optional actions):
sudo tee /usr/local/sbin/vrrp_notify.sh >/dev/null <<'EOF'
#!/bin/sh
state="$1"
logger -t keepalived-notify "VRRP state changed to: $state on $(hostname)"
# Optional: place extra actions here.
# Example: warm cache, flip feature flags, etc.
EOF
sudo chmod 0755 /usr/local/sbin/vrrp_notify.sh
Step 5: Configure keepalived on node-b (BACKUP)
On node-b, keep the VRRP details identical, but set a lower priority and BACKUP state:
sudo tee /etc/keepalived/keepalived.conf >/dev/null <<'EOF'
! Configuration File for keepalived
global_defs {
router_id node-b
enable_script_security
script_user root
}
vrrp_script chk_nginx {
script "/etc/keepalived/scripts/check_nginx_local.sh"
interval 2
timeout 2
fall 2
rise 2
weight -30
}
vrrp_instance VI_40 {
state BACKUP
interface ens6
virtual_router_id 40
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 9c7b3a2f0f
}
virtual_ipaddress {
10.10.40.50/24 dev ens6 label ens6:vip
}
track_script {
chk_nginx
}
notify_master "/usr/local/sbin/vrrp_notify.sh MASTER"
notify_backup "/usr/local/sbin/vrrp_notify.sh BACKUP"
notify_fault "/usr/local/sbin/vrrp_notify.sh FAULT"
}
EOF
sudo tee /usr/local/sbin/vrrp_notify.sh >/dev/null <<'EOF'
#!/bin/sh
state="$1"
logger -t keepalived-notify "VRRP state changed to: $state on $(hostname)"
EOF
sudo chmod 0755 /usr/local/sbin/vrrp_notify.sh
Step 6: Allow VRRP (UDP/112) on the private interface
If you run nftables, add a narrow rule allowing VRRP only on the private subnet. Example ruleset snippet:
sudo tee /etc/nftables.d/20-vrrp.nft >/dev/null <<'EOF'
add rule inet filter input iifname "ens6" ip saddr 10.10.40.0/24 udp dport 112 accept
EOF
Then include it from your main config (layout varies). If you already use an includes directory, reload:
sudo nft -f /etc/nftables.conf
If you don’t know how your firewall config is assembled, don’t freestyle on a production host. Follow your existing patterns, or use a safe migration plan like iptables to nftables migration plan.
Step 7: Start keepalived and watch it claim the VIP
Start on both nodes:
sudo systemctl enable --now keepalived
sudo systemctl status keepalived --no-pager
Now check which node owns the VIP:
ip -br a show dev ens6 | sed 's/\s\+/ /g'
Expected on node-a (MASTER): you should see 10.10.40.50/24 as a secondary address. On node-b (BACKUP): you should not see the VIP.
Also check logs:
sudo journalctl -u keepalived -n 80 --no-pager
You’re looking for clear MASTER/BACKUP transitions and health-check script results.
Step 8: Verification from a third host (the test that matters)
From any machine that can reach 10.10.40.50 (a bastion, a third VPS on the same private network, or a temporary toolbox container), run:
for i in 1 2 3; do curl -s http://10.10.40.50/healthz; done
You should consistently get ok. After that, prove which node is serving the response.
One simple approach: write the hostname to a file on each node and fetch it via the VIP. Example:
echo "node-a" | sudo tee /var/www/html/whoami.txt
# on node-b: echo "node-b" ...
curl -s http://10.10.40.50/whoami.txt
Step 9: Simulate a real failure (service-level, not host-level)
Stop Nginx on node-a and watch what happens:
sudo systemctl stop nginx
# On node-a:
ip -br a show dev ens6
# On node-b:
ip -br a show dev ens6
Expected behavior:
- Within a few seconds, node-a should lose
10.10.40.50. - Node-b should gain
10.10.40.50and start answering requests.
From the third host, run a tight loop to see the client impact:
while true; do date +%H:%M:%S; curl -s -m 1 http://10.10.40.50/healthz || echo "fail"; sleep 1; done
In a healthy setup, you’ll typically see 1–2 failed requests at most. That assumes ARP updates propagate and your thresholds stay short (interval 2, fall 2).
Bring Nginx back:
sudo systemctl start nginx
Because node-a has higher priority, it will usually preempt and reclaim the VIP once it’s healthy again. If you don’t want automatic failback, add nopreempt under vrrp_instance on both nodes. Do that only if you have a clear manual failback procedure.
Common pitfalls (the stuff that burns time)
- Provider network blocks VRRP: If UDP/112 or multicast is filtered, the nodes won’t see each other. Some providers require “unicast VRRP” configuration. Keepalived supports unicast peers; check your provider’s docs.
- Wrong interface: Configuring VRRP on the public NIC is a common mistake. Use the private interface where both nodes can reliably talk.
- VIP not routed: An “additional IP” often needs to be attached in a control panel before ARP announcements work. If ARP changes don’t propagate, the VIP can appear “up” locally but stay unreachable.
- Health check too naive: Checking only
systemctl is-activemisses cases where Nginx runs but returns 500. Keep the HTTP check. - Preemption surprise: If node-a recovers and immediately takes the VIP back, you can create a second blip. Decide up front whether that’s acceptable.
If you want a tighter hardening baseline around this setup, follow Linux VPS hardening checklist, then re-run the failover simulation.
Rollback plan (so you can bail out fast)
If failover acts weird—or your provider doesn’t route the VIP consistently—rollback should be dull and predictable.
- Pick the node that should serve traffic (say node-a). Ensure your service is running:
sudo systemctl start nginx sudo systemctl status nginx --no-pager - Stop keepalived on both nodes to prevent VIP flapping:
sudo systemctl stop keepalived - Manually add the VIP on your chosen node (temporary):
sudo ip addr add 10.10.40.50/24 dev ens6 label ens6:vip - Verify reachability from a third host:
curl -s http://10.10.40.50/healthz - Remove keepalived configs only after you’re stable. Keep the files for later, but disable auto-start if you’re not using it:
sudo systemctl disable keepalived
And yes: if your provider’s “floating IP” is API-attached, the quickest rollback may be attaching it back in the dashboard and using VRRP only for internal VIPs.
Operational polish (small changes that pay off)
- Lower noise, better logs: keepalived state changes should be rare. Log them clearly. The notify script already uses
logger, which lands in journald. - Protect against split brain: split brain is uncommon with VRRP, but partial partitions do happen. Use a private network with stable MTU, and avoid flaky overlay links.
- Combine with snapshot strategy: HA doesn’t replace backups. If you want quick recovery after a bad deploy, schedule snapshots. See Linux VPS snapshot backups.
Next steps (where to go after you’ve proven failover)
- Put a real reverse proxy in front: Terminate TLS, set timeouts, and route to your app processes. If you’re running multiple services on the same VIP, use this guide: Nginx reverse proxy on a VPS.
- Add monitoring for VIP ownership: Export
ip addrstate or ship keepalived logs into your log stack. Loki works well for this kind of event log ingestion: VPS log shipping with Loki. - Plan the database story: If your app writes data, pick one: managed DB, replication, or a single DB with strong backups and clear RTO/RPO.
- Test quarterly: Put a five-minute “stop Nginx” game day on the calendar. Untested failover isn’t a plan—it’s a guess.
Summary
Linux VPS failover with keepalived remains one of the most practical ways to eliminate single-node outages without buying a full platform. Two small servers, a routed VIP, and a health check that reflects real traffic gets you from “down hard” to “brief blip.”
If you want to run this pattern on predictable infrastructure, start with a HostMyCode VPS, and use HostMyCode migrations if you’re moving an existing service and want a clean cutover plan.
If you’re building a small HA pair for an API or edge proxy, HostMyCode gives you the right building blocks: stable networking, predictable compute, and VPS plans that scale cleanly. Start with a HostMyCode VPS, and step up to managed VPS hosting if you want patching and baseline hardening handled for you.
FAQ
Does keepalived work on every VPS provider?
No. It depends on whether the provider’s network allows VRRP (UDP/112) and whether the additional IP can be moved by ARP announcements. If VRRP is blocked, you may need unicast VRRP or a provider API-based floating IP approach.
How fast is failover in practice?
With advert_int 1 and a 2-interval health check, you’ll typically see failover in ~2–5 seconds. Client retries and DNS caching don’t matter here because the IP stays the same.
Can I use this for HTTPS?
Yes. Terminate TLS on the active node (Nginx/Caddy/HAProxy) and keep identical certificates on both nodes. If you automate certificates, DNS-based ACME is the least fragile approach for dual nodes.
How do I prevent “fail back” when the primary returns?
Use nopreempt in the vrrp_instance block so the current MASTER keeps the VIP until it fails. Document the manual failback process so you don’t end up stuck on the backup indefinitely.
What’s the simplest way to validate VIP ownership during an incident?
Run ip a show dev ens6 on both nodes, and check keepalived logs with journalctl -u keepalived. From a third host, curl http://VIP/healthz confirms the client view.