MariaDB Galera Cluster Auto-Recovery Tutorial: Complete Failover and Split-Brain Prevention Setup for Production VPS in 2026

Understanding Galera Cluster Auto-Recovery Mechanisms

MariaDB Galera cluster auto-recovery solves one of the most challenging problems in multi-master database deployments: handling node failures gracefully without manual intervention. When a cluster node crashes or becomes unreachable, traditional Galera clusters require careful manual steps to restore service.

This becomes problematic in production environments where downtime costs thousands per minute.

Auto-recovery encompasses three core mechanisms: automatic node startup sequencing, split-brain detection and prevention, and intelligent quorum management. These systems work together to maintain cluster integrity during partial failures.

Most production deployments experience node failures monthly. Without proper auto-recovery, each incident requires immediate database administrator attention, often during off-hours.

A well-configured system reduces mean time to recovery from 15-30 minutes to under 2 minutes.

Prerequisites and Environment Setup

You'll need three Ubuntu 22.04 or 24.04 VPS instances with at least 4GB RAM each. Install MariaDB 10.11 or later on all nodes:

sudo apt update
sudo apt install mariadb-server mariadb-backup -y
sudo systemctl stop mariadb

Configure each node's /etc/mysql/mariadb.conf.d/60-galera.cnf:

[galera]
bind-address=0.0.0.0
wsrep_on=ON
wsrep_cluster_name="production_cluster"
wsrep_cluster_address="gcomm://10.0.1.10,10.0.1.11,10.0.1.12"
wsrep_node_name="node1"
wsrep_node_address="10.0.1.10"
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_sst_method=mariabackup
wsrep_sst_auth="backup:secure_password"
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2

Adjust IP addresses and node names for each server. Create the backup user on all nodes:

CREATE USER 'backup'@'%' IDENTIFIED BY 'secure_password';
GRANT RELOAD, LOCK TABLES, PROCESS, REPLICATION CLIENT ON *.* TO 'backup'@'%';

For production VPS environments, HostMyCode managed VPS hosting provides pre-configured MariaDB environments that simplify this initial setup process.

Configuring Automatic Node Recovery Scripts

Create the primary auto-recovery script at /usr/local/bin/galera-auto-recovery.sh:

#!/bin/bash

LOG_FILE="/var/log/galera-recovery.log"
CLUSTER_NODES=("10.0.1.10" "10.0.1.11" "10.0.1.12")
MY_IP=$(hostname -I | awk '{print $1}')
QUORUM_SIZE=2

log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}

check_node_status() {
    local node_ip=$1
    if mysql -h "$node_ip" -u monitor -e "SELECT 1" 2>/dev/null; then
        return 0
    else
        return 1
    fi
}

count_active_nodes() {
    local active=0
    for node in "${CLUSTER_NODES[@]}"; do
        if check_node_status "$node"; then
            ((active++))
        fi
    done
    echo $active
}

start_as_primary() {
    log_message "Starting as primary component"
    sudo systemctl stop mariadb
    sudo galera_new_cluster
    sleep 5
    if systemctl is-active --quiet mariadb; then
        log_message "Successfully started as primary"
        return 0
    else
        log_message "Failed to start as primary"
        return 1
    fi
}

The script implements intelligent quorum detection. It counts active nodes before making recovery decisions.

This prevents split-brain scenarios where multiple nodes attempt primary startup simultaneously.

Split-Brain Prevention and Detection

Add split-brain detection logic to your recovery script:

detect_split_brain() {
    local active_nodes=$(count_active_nodes)
    local my_seqno=$(cat /var/lib/mysql/grastate.dat | grep "seqno:" | awk '{print $2}')
    
    if [ "$active_nodes" -ge "$QUORUM_SIZE" ]; then
        log_message "Sufficient nodes active ($active_nodes), joining cluster"
        sudo systemctl start mariadb
    elif [ "$my_seqno" = "-1" ]; then
        log_message "Unsafe shutdown detected, checking other nodes"
        check_recovery_options
    else
        log_message "Split-brain risk detected, manual intervention required"
        send_alert "Split-brain condition detected on $MY_IP"
    fi
}

check_recovery_options() {
    local highest_seqno=0
    local safest_node=""
    
    for node in "${CLUSTER_NODES[@]}"; do
        if [ "$node" != "$MY_IP" ]; then
            local remote_seqno=$(ssh "$node" "cat /var/lib/mysql/grastate.dat | grep 'seqno:' | awk '{print \$2}'" 2>/dev/null || echo "0")
            if [ "$remote_seqno" -gt "$highest_seqno" ]; then
                highest_seqno=$remote_seqno
                safest_node=$node
            fi
        fi
    done
    
    if [ -n "$safest_node" ]; then
        log_message "Node $safest_node has highest seqno ($highest_seqno), waiting for cluster"
        wait_for_cluster
    else
        log_message "No safe node found, attempting bootstrap"
        start_as_primary
    fi
}

This detection mechanism compares sequence numbers across nodes to identify the most recent data state. It prevents data loss by ensuring only the node with the highest sequence number can bootstrap the cluster.

Implementing Failover Automation

Configure systemd service for automated failover at /etc/systemd/system/galera-monitor.service:

[Unit]
Description=Galera Cluster Auto-Recovery Monitor
After=network.target
Wants=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/galera-monitor.sh
Restart=always
RestartSec=30
User=mysql
Group=mysql

[Install]
WantedBy=multi-user.target

Create the monitoring daemon at /usr/local/bin/galera-monitor.sh:

#!/bin/bash

MONITOR_INTERVAL=10
FAILURE_THRESHOLD=3
failure_count=0

while true; do
    if ! systemctl is-active --quiet mariadb; then
        ((failure_count++))
        log_message "MariaDB service down, failure count: $failure_count"
        
        if [ $failure_count -ge $FAILURE_THRESHOLD ]; then
            log_message "Threshold reached, initiating auto-recovery"
            /usr/local/bin/galera-auto-recovery.sh
            failure_count=0
        fi
    else
        # Check cluster status
        local cluster_size=$(mysql -e "SHOW STATUS LIKE 'wsrep_cluster_size'" -s -N | awk '{print $2}' 2>/dev/null || echo "0")
        
        if [ "$cluster_size" -lt "2" ]; then
            log_message "Cluster size below threshold: $cluster_size"
            ((failure_count++))
        else
            failure_count=0
        fi
    fi
    
    sleep $MONITOR_INTERVAL
done

Enable and start the monitoring service:

sudo chmod +x /usr/local/bin/galera-*.sh
sudo systemctl enable galera-monitor
sudo systemctl start galera-monitor

Configuring Cluster State Monitoring

Implement comprehensive cluster health monitoring:

#!/bin/bash
# /usr/local/bin/cluster-health-check.sh

check_cluster_health() {
    local wsrep_ready=$(mysql -e "SHOW STATUS LIKE 'wsrep_ready'" -s -N | awk '{print $2}' 2>/dev/null)
    local wsrep_connected=$(mysql -e "SHOW STATUS LIKE 'wsrep_connected'" -s -N | awk '{print $2}' 2>/dev/null)
    local cluster_status=$(mysql -e "SHOW STATUS LIKE 'wsrep_cluster_status'" -s -N | awk '{print $2}' 2>/dev/null)
    
    if [ "$wsrep_ready" = "ON" ] && [ "$wsrep_connected" = "ON" ] && [ "$cluster_status" = "Primary" ]; then
        return 0
    else
        log_message "Cluster health check failed: ready=$wsrep_ready, connected=$wsrep_connected, status=$cluster_status"
        return 1
    fi
}

check_replication_lag() {
    local flow_control=$(mysql -e "SHOW STATUS LIKE 'wsrep_flow_control_paused'" -s -N | awk '{print $2}' 2>/dev/null)
    local local_queue=$(mysql -e "SHOW STATUS LIKE 'wsrep_local_recv_queue'" -s -N | awk '{print $2}' 2>/dev/null)
    
    if [ "$flow_control" != "0.000000" ] || [ "$local_queue" -gt "100" ]; then
        log_message "Replication lag detected: flow_control=$flow_control, queue=$local_queue"
        return 1
    fi
    return 0
}

This monitoring tracks multiple Galera-specific metrics beyond basic service status.

Flow control monitoring prevents performance degradation from becoming cluster-wide failures.

High-traffic applications running on HostMyCode dedicated servers particularly benefit from this comprehensive monitoring approach. It identifies issues before they trigger failover events.

Graceful Shutdown and Bootstrap Procedures

Configure proper shutdown sequencing to maintain cluster state:

#!/bin/bash
# /usr/local/bin/graceful-shutdown.sh

graceful_node_shutdown() {
    log_message "Initiating graceful shutdown"
    
    # Check if we're the last node
    local active_nodes=$(count_active_nodes)
    if [ "$active_nodes" -le "1" ]; then
        log_message "Last active node, setting safe_to_bootstrap"
        sudo sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/' /var/lib/mysql/grastate.dat
    fi
    
    # Graceful shutdown
    mysql -e "SET GLOBAL wsrep_desync=ON;"
    sleep 2
    sudo systemctl stop mariadb
    
    log_message "Graceful shutdown completed"
}

bootstrap_primary_component() {
    # Verify it's safe to bootstrap
    local safe_bootstrap=$(grep "safe_to_bootstrap:" /var/lib/mysql/grastate.dat | awk '{print $2}')
    
    if [ "$safe_bootstrap" = "1" ]; then
        log_message "Bootstrapping primary component"
        sudo galera_new_cluster
        
        # Reset flag after successful startup
        sudo sed -i 's/safe_to_bootstrap: 1/safe_to_bootstrap: 0/' /var/lib/mysql/grastate.dat
    else
        log_message "Bootstrap not safe, joining existing cluster"
        sudo systemctl start mariadb
    fi
}

Alert Integration and Notification Setup

Implement alerting for critical cluster events:

#!/bin/bash
# Alert functions

send_alert() {
    local message="$1"
    local severity="$2"
    
    # Slack webhook notification
    if [ -n "$SLACK_WEBHOOK_URL" ]; then
        curl -X POST -H 'Content-type: application/json' \
             --data "{\"text\":\"[$severity] Galera Cluster: $message\"}" \
             "$SLACK_WEBHOOK_URL"
    fi
    
    # Email notification
    echo "$message" | mail -s "Galera Cluster Alert [$severity]" admin@yourcompany.com
    
    # Log to syslog
    logger -t galera-cluster "[$severity] $message"
}

# Example usage in recovery script
if [ "$recovery_attempt" -gt "3" ]; then
    send_alert "Auto-recovery failed after 3 attempts on $MY_IP" "CRITICAL"
fi

Testing Auto-Recovery Scenarios

Test your configuration with controlled failure scenarios:

# Test 1: Single node failure
sudo systemctl stop mariadb
# Wait 2 minutes, verify other nodes maintain service
# Check auto-recovery logs

# Test 2: Network partition simulation
sudo iptables -A INPUT -s 10.0.1.11 -j DROP
sudo iptables -A INPUT -s 10.0.1.12 -j DROP
# Verify split-brain prevention

# Test 3: Majority node failure
# Stop MariaDB on 2 of 3 nodes simultaneously
# Verify remaining node doesn't bootstrap without quorum

Document each test result and recovery time. Successful auto-recovery should complete within 60-120 seconds for single node failures.

Performance Monitoring and Optimization

Monitor auto-recovery impact on cluster performance:

# Key metrics to track
mysql -e "SHOW STATUS LIKE 'wsrep_%'" | grep -E "(local_commits|local_replays|flow_control_paused|cluster_size)"

# Recovery time tracking
grep "auto-recovery" /var/log/galera-recovery.log | awk '{print $2, $3, $NF}'

# Resource usage during recovery
sar -u -r 1 60 > /tmp/recovery-performance.log

Auto-recovery operations typically consume 10-15% additional CPU during the recovery window. Plan capacity accordingly for production workloads.

Running production MariaDB Galera clusters requires solid infrastructure and monitoring. HostMyCode managed VPS hosting provides the high-performance environment and 24/7 support needed for mission-critical database clusters. HostMyCode dedicated servers offer the compute resources and network reliability essential for multi-master database deployments.

Frequently Asked Questions

How long should auto-recovery take in production?

Properly configured auto-recovery should complete within 60-120 seconds for single node failures. Multi-node failures may require 3-5 minutes depending on data synchronization requirements.

What happens if all three nodes fail simultaneously?

Total cluster failure requires manual intervention. The auto-recovery script will attempt to identify the node with the most recent data (highest seqno) and bootstrap from there, but manual verification is recommended for data integrity.

Can auto-recovery cause data loss?

Properly configured sequence number checking prevents data loss. The scripts only bootstrap nodes that can be verified as having the most recent committed transactions.

How do I tune recovery sensitivity?

Adjust the FAILURE_THRESHOLD and MONITOR_INTERVAL variables in the monitoring script. Lower thresholds provide faster recovery but may cause unnecessary failovers during brief network issues.

Should I use auto-recovery in development environments?

Development environments benefit from faster recovery, but consider using more aggressive thresholds since data consistency is less critical than rapid restoration of service.