
Why Database High Availability Matters for Production VPS
Database downtime costs businesses an average of $5,600 per minute. A single PostgreSQL crash wipes out hours of transactions, corrupts user sessions, and damages your reputation.
PostgreSQL high availability setup eliminates single points of failure. It maintains multiple synchronized database servers that automatically take over when the primary fails.
This tutorial builds a robust PostgreSQL cluster using repmgr on Ubuntu 24.04. You'll configure streaming replication, automatic failover, and monitoring across multiple HostMyCode VPS instances.
We'll use three servers: one primary, one standby, and one witness node. The witness prevents split-brain scenarios without running a full database instance.
Prerequisites and Server Specifications
You need three Linux VPS instances with these minimum specs:
- Primary and Standby: 4GB RAM, 2 CPU cores, 40GB SSD
- Witness: 1GB RAM, 1 CPU core, 10GB SSD
- Ubuntu 24.04 LTS on all nodes
- PostgreSQL 16 (we'll install this)
- Private network connectivity between servers
Configure hostnames in /etc/hosts on all servers:
192.168.1.10 pg-primary
192.168.1.11 pg-standby
192.168.1.12 pg-witness
Replace these IP addresses with your actual private network IPs. Test connectivity with ping pg-primary from each node.
Install PostgreSQL 16 and Configure Base Settings
Run these commands on the primary and standby servers (not the witness):
sudo apt update
sudo apt install -y wget ca-certificates
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
echo "deb http://apt.postgresql.org/pub/repos/apt/ jammy-pgdg main" | sudo tee /etc/apt/sources.list.d/pgdg.list
sudo apt update
sudo apt install -y postgresql-16 postgresql-client-16 postgresql-16-repmgr
Stop PostgreSQL to configure it properly:
sudo systemctl stop postgresql
Edit /etc/postgresql/16/main/postgresql.conf on the primary server:
listen_addresses = '*'
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 1GB
hot_standby = on
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/16/archive/%f && cp %p /var/lib/postgresql/16/archive/%f'
shared_preload_libraries = 'repmgr'
Create the archive directory:
sudo mkdir -p /var/lib/postgresql/16/archive
sudo chown postgres:postgres /var/lib/postgresql/16/archive
Configure authentication in /etc/postgresql/16/main/pg_hba.conf:
# Replication connections
host replication repmgr 192.168.1.0/24 trust
host repmgr repmgr 192.168.1.0/24 trust
# Application connections
host all all 192.168.1.0/24 md5
Start PostgreSQL on the primary:
sudo systemctl start postgresql
sudo systemctl enable postgresql
Create Repmgr User and Database
Connect to PostgreSQL as the postgres user:
sudo -u postgres psql
Create the repmgr user and database:
CREATE USER repmgr WITH REPLICATION SUPERUSER LOGIN;
CREATE DATABASE repmgr OWNER repmgr;
ALTER USER repmgr SET search_path TO repmgr, "$user", public;
\q
This dedicated user manages replication and cluster operations. Superuser privilege allows repmgr to modify system settings during failover.
Configure Repmgr on Primary Server
Create /etc/repmgr.conf on the primary server:
node_id=1
node_name='pg-primary'
conninfo='host=pg-primary port=5432 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/16/main'
config_directory='/etc/postgresql/16/main'
replication_user='repmgr'
replication_type=physical
location='datacenter1'
use_replication_slots=yes
witness_sync_interval=15
reconnect_attempts=6
reconnect_interval=10
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=2
degradation_threshold=5
unhealthy_threshold=3
Register the primary node:
sudo -u postgres repmgr -f /etc/repmgr.conf primary register
Verify registration:
sudo -u postgres repmgr -f /etc/repmgr.conf cluster show
You should see the primary node listed with status "running."
Set Up Standby Server with Streaming Replication
On the standby server, stop PostgreSQL if it's running:
sudo systemctl stop postgresql
Remove the existing data directory:
sudo rm -rf /var/lib/postgresql/16/main/*
Clone the primary database to the standby:
sudo -u postgres repmgr -h pg-primary -U repmgr -d repmgr -f /etc/repmgr.conf standby clone
This command copies the entire database from the primary and configures streaming replication. The process takes 2-10 minutes depending on database size.
Create /etc/repmgr.conf on the standby server:
node_id=2
node_name='pg-standby'
conninfo='host=pg-standby port=5432 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/16/main'
config_directory='/etc/postgresql/16/main'
replication_user='repmgr'
replication_type=physical
location='datacenter1'
use_replication_slots=yes
witness_sync_interval=15
reconnect_attempts=6
reconnect_interval=10
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=2
degradation_threshold=5
unhealthy_threshold=3
Start PostgreSQL on the standby:
sudo systemctl start postgresql
sudo systemctl enable postgresql
Register the standby node:
sudo -u postgres repmgr -f /etc/repmgr.conf standby register
Configure Witness Node for Split-Brain Prevention
The witness node prevents split-brain scenarios when network partitions occur. Install only the repmgr client on the witness server:
sudo apt update
sudo apt install -y postgresql-client-16 postgresql-16-repmgr
Create /etc/repmgr.conf on the witness server:
node_id=3
node_name='pg-witness'
conninfo='host=pg-witness port=5432 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/witness'
replication_user='repmgr'
replication_type=physical
location='datacenter1'
witness_sync_interval=15
reconnect_attempts=6
reconnect_interval=10
Register the witness node:
sudo -u postgres repmgr -f /etc/repmgr.conf witness register -h pg-primary
Verify the complete cluster:
sudo -u postgres repmgr -f /etc/repmgr.conf cluster show
You should see three nodes: primary (running), standby (running), and witness (running).
Enable Automatic Failover with Repmgrd
The repmgrd daemon monitors cluster health and triggers automatic failover. Start it on all nodes.
Create systemd service files. On Ubuntu, copy the example service:
sudo cp /usr/share/doc/postgresql-16-repmgr/examples/repmgrd.service /etc/systemd/system/
sudo systemctl daemon-reload
Edit /etc/systemd/system/repmgrd.service to use the correct config path:
[Unit]
Description=A replication manager, and failover management tool for PostgreSQL
After=postgresql.service
[Service]
Type=forking
User=postgres
ExecStart=/usr/bin/repmgrd -f /etc/repmgr.conf --pid-file=/var/run/postgresql/repmgrd.pid --daemonize
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/postgresql/repmgrd.pid
[Install]
WantedBy=multi-user.target
Start repmgrd on all nodes:
sudo systemctl enable repmgrd
sudo systemctl start repmgrd
Check repmgrd status:
sudo systemctl status repmgrd
For detailed insights on PostgreSQL monitoring, check our database performance monitoring guide. It covers essential metrics and alerting strategies.
Test Replication and Verify Data Synchronization
Create test data on the primary to verify replication works:
sudo -u postgres psql -c "CREATE DATABASE testdb;"
sudo -u postgres psql testdb -c "CREATE TABLE users (id SERIAL PRIMARY KEY, name VARCHAR(50));"
sudo -u postgres psql testdb -c "INSERT INTO users (name) VALUES ('Alice'), ('Bob'), ('Charlie');"
sudo -u postgres psql testdb -c "SELECT * FROM users;"
Check if data appears on the standby (read-only):
sudo -u postgres psql -h pg-standby testdb -c "SELECT * FROM users;"
You should see the same three records. If not, check replication lag:
sudo -u postgres psql -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
Healthy replication shows minimal lag values. Look for under 1MB for LSN differences.
Test Automatic Failover Scenarios
Simulate primary server failure to test automatic failover:
# On primary server - simulate crash
sudo systemctl stop postgresql
Monitor the standby server logs:
sudo tail -f /var/log/postgresql/postgresql-16-main.log
Within 30-60 seconds, repmgrd detects the failure and promotes the standby to primary. Check cluster status:
sudo -u postgres repmgr -f /etc/repmgr.conf cluster show
The former standby should now show as "primary" and the old primary as "failed." Test write operations on the new primary:
sudo -u postgres psql -h pg-standby testdb -c "INSERT INTO users (name) VALUES ('David');"
This should work without errors. When you restart the failed server, it rejoins as a standby:
# On the recovered server
sudo -u postgres repmgr node rejoin -f /etc/repmgr.conf -d 'host=pg-standby user=repmgr dbname=repmgr' --force-rewind
For comprehensive backup strategies that complement high availability, see our Linux VPS MySQL backup tutorial. It covers similar recovery principles.
Monitor Cluster Health and Performance
Set up monitoring to catch issues before they cause outages:
# View detailed cluster information
sudo -u postgres repmgr -f /etc/repmgr.conf cluster show --compact
# Check replication lag
sudo -u postgres repmgr -f /etc/repmgr.conf cluster monitor
Monitor repmgrd logs for warnings:
sudo journalctl -u repmgrd -f
Key metrics to track:
- Replication lag (should be under 1MB LSN difference)
- Connection count to both servers
- Disk space on archive directory
- Network connectivity between nodes
Create a simple monitoring script at /home/monitor/check_repmgr.sh:
#!/bin/bash
LAG=$(sudo -u postgres psql -t -c "SELECT CASE WHEN pg_is_in_recovery() THEN pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() ELSE 0 END;" | tr -d ' ')
if [ "$LAG" -gt 16777216 ]; then # 16MB in bytes
echo "WARNING: Replication lag is ${LAG} bytes"
exit 1
else
echo "OK: Replication lag is ${LAG} bytes"
exit 0
fi
Run this script every 5 minutes via cron to detect replication issues early.
Troubleshoot Common High Availability Issues
When replication stops working, check these common causes:
Connection refused errors: Verify pg_hba.conf allows replication connections from the standby IP. Check firewall rules allow port 5432 between servers.
Archive directory full: The /var/lib/postgresql/16/archive directory can fill up quickly. Set up log rotation or move old WAL files to backup storage.
Replication slot conflicts: If failover fails, you may have conflicting replication slots:
sudo -u postgres psql -c "SELECT slot_name, active FROM pg_replication_slots;"
sudo -u postgres psql -c "SELECT pg_drop_replication_slot('slot_name');"
Split-brain prevention: Always use a witness node or external fencing mechanism. Never run two primaries simultaneously as this corrupts data.
Our VPS troubleshooting checklist covers network and performance issues that can affect database replication.
Performance Tuning for High Availability
Optimize PostgreSQL settings for your managed VPS hosting environment:
In /etc/postgresql/16/main/postgresql.conf:
# Memory settings for 4GB VPS
shared_buffers = 1GB
effective_cache_size = 3GB
work_mem = 16MB
maintenance_work_mem = 256MB
# WAL settings for replication
wal_buffers = 16MB
checkpoint_completion_target = 0.9
wal_compression = on
# Connection settings
max_connections = 200
max_worker_processes = 4
Restart PostgreSQL after configuration changes:
sudo systemctl restart postgresql
Monitor performance with:
sudo -u postgres psql -c "SELECT datname, xact_commit, xact_rollback, blks_read, blks_hit, temp_files, temp_bytes FROM pg_stat_database;"
High blks_read values indicate insufficient shared_buffers. Frequent temp_files suggest low work_mem.
Ready to deploy PostgreSQL high availability on production-ready infrastructure? HostMyCode VPS hosting provides the reliable network connectivity and performance you need for database clusters. Our managed VPS plans include 24/7 monitoring and expert support to keep your setup running smoothly.
Frequently Asked Questions
How long does PostgreSQL failover take with repmgr?
Automatic failover typically completes in 30-60 seconds. Detection takes 10-30 seconds based on your monitoring_history and degradation_threshold settings. Promotion itself takes 5-15 seconds. Applications experience a brief connection interruption during the switchover.
Can I run more than one standby server in the cluster?
Yes, you can add multiple standbys for read scaling and additional redundancy. Each standby needs a unique node_id in repmgr.conf. Use repmgr standby clone to create additional replicas from the primary or an existing standby.
What happens if the witness node goes down?
The cluster continues operating normally. The witness only prevents split-brain during network partitions. Without it, you risk having two primaries if the network splits but both database servers remain operational. Always restore the witness quickly.
How much additional storage does streaming replication require?
Each standby needs the same storage as the primary for the database itself. Additionally, configure at least 2-5GB for WAL archives on the primary. Monitor archive directory size and implement cleanup policies based on your backup retention needs.
Can I perform schema changes on a PostgreSQL HA cluster?
Yes, execute DDL statements on the primary and they replicate automatically to standbys. Avoid long-running schema migrations during peak hours as they can increase replication lag. Test major changes on a staging cluster first to verify timing and impact.