Systemd watchdog on a VPS: self-healing services with health checks, automatic restarts, and safe rollbacks (2026)

A service that “restarts on failure” still won’t save you from a process that’s alive but useless. In 2026, the outages that actually bite on a VPS are usually soft failures: an event loop deadlock, a downstream call that never returns, a DNS stall, or a worker pool that stops taking work while the PID keeps running.

This tutorial shows how to set up systemd watchdog on a VPS for a small internal API so the OS can detect a hung service and recover it automatically. You’ll add a real health check, wire in watchdog heartbeats, confirm behavior in logs, and keep a rollback path that doesn’t require heroics.

Scenario and what you’ll build

You’re running a simple “build status” API for a CI runner fleet. It listens on 127.0.0.1:9081, and Nginx (or Caddy) publishes it to the internet. Every so often, a rare dependency timeout triggers a hang: the process stays up, but it stops serving requests.

OS: Debian 12 (Bookworm) on a VPS
App: a tiny Python 3.11 HTTP service (no framework required)
Goal: detect hangs and restart automatically using WatchdogSec= + sd_notify()
Bonus: add a systemd health gate using ExecStartPre and a readiness notification

If you need the reverse proxy side, HostMyCode has a strong reference on safe multi-app routing and rollbacks: Nginx reverse proxy on a VPS (2026).

Prerequisites

A VPS with root access (1 vCPU / 1 GB RAM is enough for this lab)
Debian 12 installed, with systemd (default)
Python 3.11+ and pip available
A basic firewall rule set allowing SSH (and optionally HTTP/HTTPS if you expose the service)

If your baseline firewall isn’t set yet, apply an SSH-safe policy first: UFW Firewall Setup for a VPS in 2026.

Hosting note: if you want predictable CPU scheduling and clean systemd behavior (especially under load tests), use a dedicated VM rather than shared hosting. A HostMyCode VPS is the right fit for this pattern.

Step 1 — Create a locked-down service user and directories

Run these as root:

useradd --system --home /var/lib/buildstatus --create-home --shell /usr/sbin/nologin buildstatus
install -d -o buildstatus -g buildstatus /opt/buildstatus
install -d -o buildstatus -g buildstatus /var/log/buildstatus

Expected output: no output on success.

Step 2 — Install dependencies (python + systemd notify bindings)

We’ll use python3-systemd to send READY=1 and watchdog heartbeats via sd_notify.

apt-get update
apt-get install -y python3 python3-venv python3-systemd curl

Quick check:

python3 -c "from systemd import daemon; print('systemd bindings OK')"

Expected output:

systemd bindings OK

Step 3 — Write a minimal HTTP service with a real liveness condition

Create /opt/buildstatus/app.py:

cat > /opt/buildstatus/app.py <<'PY'
#!/usr/bin/env python3
import os
import time
import json
import socket
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
from systemd import daemon

HOST = "127.0.0.1"
PORT = int(os.environ.get("BUILDSTATUS_PORT", "9081"))

# This simulates a dependency check that can hang if your code wedges.
# We'll keep it simple: a background thread updates a heartbeat timestamp.
_last_tick = time.monotonic()
_last_tick_lock = threading.Lock()

class Handler(BaseHTTPRequestHandler):
    def _send(self, code, payload):
        body = json.dumps(payload).encode("utf-8")
        self.send_response(code)
        self.send_header("Content-Type", "application/json")
        self.send_header("Content-Length", str(len(body)))
        self.end_headers()
        self.wfile.write(body)

    def do_GET(self):
        if self.path == "/healthz":
            with _last_tick_lock:
                age = time.monotonic() - _last_tick
            # If the background loop stops, age grows.
            if age > 15:
                self._send(503, {"ok": False, "reason": "tick_stale", "age_seconds": round(age, 3)})
                return
            self._send(200, {"ok": True, "age_seconds": round(age, 3)})
            return

        if self.path == "/":
            self._send(200, {"service": "buildstatus", "status": "running"})
            return

        self._send(404, {"error": "not_found"})

    def log_message(self, fmt, *args):
        # Quiet default logging; rely on journald.
        return

def tick_loop():
    global _last_tick
    while True:
        time.sleep(2)
        with _last_tick_lock:
            _last_tick = time.monotonic()


def watchdog_loop():
    # systemd sets WATCHDOG_USEC when WatchdogSec= is enabled.
    usec = int(os.environ.get("WATCHDOG_USEC", "0"))
    if usec <= 0:
        return

    # Send heartbeats at ~half interval.
    interval = (usec / 1_000_000) / 2
    while True:
        # Gate the heartbeat on our own health check.
        with _last_tick_lock:
            age = time.monotonic() - _last_tick
        if age <= 10:
            daemon.notify("WATCHDOG=1")
        time.sleep(max(1, interval))


def bind_check(host, port):
    # Fail fast if the port is already taken.
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.bind((host, port))
    finally:
        s.close()


def main():
    bind_check(HOST, PORT)

    t1 = threading.Thread(target=tick_loop, daemon=True)
    t1.start()

    t2 = threading.Thread(target=watchdog_loop, daemon=True)
    t2.start()

    httpd = HTTPServer((HOST, PORT), Handler)

    # Tell systemd the service is ready.
    daemon.notify("READY=1")

    httpd.serve_forever()

if __name__ == "__main__":
    main()
PY
chmod 0755 /opt/buildstatus/app.py
chown -R buildstatus:buildstatus /opt/buildstatus

What matters here:

/healthz returns 503 if our internal tick stops updating (a stand-in for “app wedged”).
We only send WATCHDOG=1 when internal health is good. If the app hangs, heartbeats stop and systemd restarts it.
READY=1 makes startup behavior deterministic (useful in orchestration and during restarts).

Step 4 — Create the systemd unit with watchdog + hardening

Create /etc/systemd/system/buildstatus.service:

cat > /etc/systemd/system/buildstatus.service <<'UNIT'
[Unit]
Description=Build Status Internal API
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
User=buildstatus
Group=buildstatus
WorkingDirectory=/opt/buildstatus

Environment=BUILDSTATUS_PORT=9081

# Fail fast before starting if the port is in use.
ExecStartPre=/usr/bin/bash -lc 'ss -lnt | awk "{print \$4}" | grep -q ":9081$" && { echo "port 9081 already in use"; exit 1; } || exit 0'

ExecStart=/opt/buildstatus/app.py

# Restart policy: restart on non-zero exit or watchdog timeout.
Restart=on-failure
RestartSec=2

# Watchdog: if heartbeats stop, systemd will kill/restart.
WatchdogSec=20

# Reasonable limits for a tiny internal API.
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/buildstatus /var/lib/buildstatus
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectControlGroups=yes
LockPersonality=yes
MemoryDenyWriteExecute=yes
RestrictSUIDSGID=yes
RestrictRealtime=yes
SystemCallArchitectures=native

# Only needs basic networking.
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX

# Logging
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
UNIT

Reload and start:

systemctl daemon-reload
systemctl enable --now buildstatus.service
systemctl status --no-pager buildstatus.service

Expected status indicators:

Active: active (running)
Status: "Ready" is not always printed, but Type=notify will wait for readiness.

Step 5 — Verify the service locally (health, readiness, and port)

Check the listener:

ss -lntp | grep ':9081'

Expected output (PID will differ):

LISTEN 0 4096 127.0.0.1:9081 0.0.0.0:* users:(("python3",pid=12345,fd=3))

Hit the endpoints:

curl -sS http://127.0.0.1:9081/ | jq .
curl -sS -i http://127.0.0.1:9081/healthz

Expected output snippets:

{
  "service": "buildstatus",
  "status": "running"
}
HTTP/1.0 200 OK

If you don’t have jq, skip it; the real check is the HTTP status codes.

Step 6 — Prove the watchdog works (forced hang simulation)

Now you’ll test the failure mode you actually care about: “the process is there, but it isn’t making progress.” We’ll freeze the service. With WatchdogSec=20, systemd should restart it once the heartbeat stops long enough.

Get the main PID:

systemctl show -p MainPID --value buildstatus.service

Freeze it with SIGSTOP (the process won’t run, so no watchdog heartbeats):
```
kill -STOP $(systemctl show -p MainPID --value buildstatus.service)
```

Watch systemd logs for the watchdog event:

journalctl -u buildstatus.service -n 50 --no-pager

Within ~20–30 seconds you should see lines similar to:

buildstatus.service: Watchdog timeout (limit 20s)! 
buildstatus.service: Killing process 12345 (python3) with signal SIGABRT.
buildstatus.service: Main process exited, code=killed, status=6/ABRT
buildstatus.service: Scheduled restart job, restart counter is at 1.

Finally, confirm the service is back:

systemctl is-active buildstatus.service
curl -sS -i http://127.0.0.1:9081/healthz | head

Expected:

active
HTTP/1.0 200 OK

Step 7 — Add an external HTTP check (optional, but practical)

The watchdog heartbeat is internal by design. It’s great at catching hangs, but it can’t tell you if your reverse proxy is misrouted, your firewall blocks traffic, or TLS/DNS broke. That’s why you still publish a simple external check.

If you already run Nginx, proxy to the internal port and expose /healthz on your domain. A minimal server block snippet:

location /buildstatus/ {
  proxy_pass http://127.0.0.1:9081/;
  proxy_set_header Host $host;
  proxy_set_header X-Real-IP $remote_addr;
}

location = /buildstatus/healthz {
  proxy_pass http://127.0.0.1:9081/healthz;
}

Then verify from outside your VPS (or from a monitoring box):

curl -sS -i https://example.com/buildstatus/healthz | head

If you need safe rollback patterns for Nginx config changes, keep this nearby: safe rollbacks with Nginx on a VPS.

Step 8 — Tuning watchdog intervals and restart behavior

Pick values that match how your service fails in real life. These guidelines hold up well on small VPS deployments:

WatchdogSec: set it to ~2–4x your normal worst-case request latency. For internal APIs that should respond within 200–500ms, 20s is conservative and avoids false positives during CPU spikes.
RestartSec: 1–3 seconds works well for small services. If a dependency outage is likely (DB down), a longer backoff cuts log noise.
StartLimitIntervalSec + StartLimitBurst: keep a bad deploy from thrashing the VPS.

Add loop protection if you expect bad deploys:

systemctl edit buildstatus.service

Drop-in file (/etc/systemd/system/buildstatus.service.d/override.conf):

[Unit]
StartLimitIntervalSec=120
StartLimitBurst=5

[Service]
Restart=on-failure
RestartSec=3

Apply:

systemctl daemon-reload
systemctl restart buildstatus.service

Common pitfalls (and how to avoid them)

Forgetting Type=notify. If you send READY=1 but keep Type=simple, systemd won’t track readiness correctly. The service may still run, but you lose the semantics that make restarts predictable.
Sending watchdog heartbeats unconditionally. If your code keeps sending WATCHDOG=1 while it’s stuck, systemd has nothing to act on. Tie heartbeats to a cheap internal signal: queue progress, loop ticks, thread heartbeat, or a dependency probe with a hard timeout.
Setting an aggressive watchdog interval. A 2–5 second watchdog looks strict on paper, then flaps during GC pauses or noisy-neighbor bursts. Start at 20–60 seconds and tighten only after you’ve measured behavior under load.
Restart loops that chew disk. Fast crash loops can flood journald and any app logs you write. Keep rotation sane; this pairs well with VPS log rotation best practices.
Binding to 0.0.0.0 accidentally. For internal services, bind to 127.0.0.1 and expose through a reverse proxy. If you need private admin access across hosts, a VPN is usually cleaner than opening ports; see Tailscale VPS VPN setup.

Rollback plan (clean and fast)

If the watchdog causes unexpected restarts, you should be able to back out quickly and leave the rest of the unit intact.

Disable watchdog without deleting the service:

systemctl edit buildstatus.service

[Service]
WatchdogSec=0
Type=simple

Then reload and restart:

systemctl daemon-reload
systemctl restart buildstatus.service

If your last code change is suspect, revert the app file from backup or git and restart:

cp /opt/buildstatus/app.py /opt/buildstatus/app.py.bad
# restore known-good version here
systemctl restart buildstatus.service

If you need to fully remove the service:

systemctl disable --now buildstatus.service
rm -f /etc/systemd/system/buildstatus.service
rm -rf /etc/systemd/system/buildstatus.service.d
systemctl daemon-reload

For broader rollback discipline (snapshots + staged updates), keep a patching runbook. HostMyCode covers it in depth here: VPS Patch Management in 2026.

Next steps (make it production-friendly)

Add metrics: export a counter for watchdog restarts and health failures. If you already run OpenTelemetry, you can centralize signals without lock-in; see VPS monitoring with OpenTelemetry Collector.
Store state outside the process: anything the service needs after restart (queue position, tokens) should live in Redis/Postgres, not in RAM.
Introduce a canary restart: restart during a maintenance window and verify health endpoints and logs before peak hours.
Backups: if this service writes anything important, add file-level backups plus periodic restore tests.

Summary

Systemd’s watchdog gives you a low-dependency self-healing loop: your service proves it’s healthy, and the init system enforces that claim. The key is discipline. Your heartbeat has to mean something, and it must stop when the app stops making progress.

If you want this running on a clean Linux VM with predictable restart behavior, start with a HostMyCode VPS. If you’d rather not maintain unit hardening, restarts, and OS updates yourself, managed VPS hosting can take the operational edge off while you keep full application control.

If you run internal APIs or a small SaaS backend on one VM, watchdog-based self-healing is one of the highest-ROI reliability improvements you can add. HostMyCode handles this pattern well on a HostMyCode VPS, and you can offload patching and baseline hardening to managed VPS hosting if you want to keep operations lightweight.

FAQ

Does systemd watchdog replace external monitoring?

No. Watchdog catches “process is alive but broken” cases locally. You still want an outside check for DNS, TLS, proxy routing, and upstream connectivity.

What’s a good WatchdogSec value for APIs?

For internal APIs, start at 20–60 seconds. Tighten only after you’ve observed worst-case latency under load and during deploys.

Will watchdog restart on high CPU or memory pressure?

Indirectly. If your service can’t run its heartbeat loop (stalled scheduler, deadlock, extreme GC), heartbeats stop and systemd restarts it. For OOM kills, you’ll typically see a regular crash/restart instead.

Can I use systemd watchdog with Node.js or Go?

Yes. Node can use a small native binding or a sidecar heartbeat script that calls systemd-notify. Go can call sd_notify via libraries like coreos/go-systemd or execute systemd-notify directly.

How do I confirm watchdog restarts happened?

Check journalctl -u buildstatus.service for “Watchdog timeout” and “Scheduled restart job” lines, and compare the MainPID before/after.