VPS Incident Response Checklist (2026): A Practical Linux Runbook for Fast Triage and Containment

Most VPS incidents don’t begin with a dramatic outage. They start as a small signal: a CPU spike at 03:12, a new listening port, a burst of outbound traffic, or a disk that suddenly hits 100%. In 2026, the difference between a minor scare and a real breach often comes down to what you do in the first 15 minutes.

This VPS incident response checklist is a practical, Linux-first runbook you can keep in your ops repo. It’s for sysadmins and developers running production services on a VPS—APIs, internal tools, WordPress, background workers—where you need fast triage, safe containment, and evidence you can trust later.

Scope: what this runbook covers (and what it doesn’t)

This checklist targets single-host and small-fleet VPS incidents: suspicious processes, unexpected network activity, credential compromise, file tampering, and sudden performance regressions. It assumes you have SSH access.

Covers: triage, containment, evidence capture, eradication, recovery, verification, and rollback decision points.
Doesn’t cover: complex multi-region forensics, kernel rootkits at scale, or legal chain-of-custody requirements (though the evidence steps still help).

Prerequisites (prepare before you need them)

If you wait until an incident to set up the basics, you’ll improvise under pressure. Put this in place while everything’s quiet:

Out-of-band access: provider console / rescue mode / snapshot capability.
Known-good baseline: documented open ports, system users, services, and critical file hashes.
Central logs or at least log retention: journald persistence and reasonable rotation.
Backups and restore test: file-level backups plus snapshots, with periodic restore drills.
Tooling installed: jq, curl, tcpdump, lsof, ripgrep, auditd (optional), and a safe editor.

If you’re running production workloads, use a VPS plan that makes snapshots and emergency access easy. A HostMyCode VPS is a solid option if you want predictable Linux performance plus the operational fundamentals (snapshots, console access, clean networking).

Phase 0: decide if this is an incident (2-minute triage)

Start by classifying what you’re seeing. Is this a routine regression, or does it look hostile? Use quick signals and move.

Check uptime and immediate resource pressure
```
uptime
free -h
df -hT
```
Expected output: load average consistent with your baseline; memory not pinned; root filesystem not > 90% used.
Confirm what’s listening and who opened it
```
ss -lntup | head -n 50
```
Look for unexpected listeners (example: 0.0.0.0:48721) or processes you can’t identify.
Scan for suspicious recent auth events
```
journalctl -u ssh --since "-2 hours" --no-pager | tail -n 80
last -a | head
```
Red flags: logins from new countries/ASNs, sudden bursts of failed logins, root login attempts, or successful auth outside normal windows.

If you suspect active compromise, contain first. If it looks like a pure availability issue, follow the same structure and skip the heavier forensics steps.

Phase 1: containment without destroying evidence

This is where teams do damage with good intentions. Don’t “clean up” yet. Don’t reboot unless you have to. Stop the bleeding while keeping the host observable.

Open a root shell and start a session log

sudo -i
mkdir -p /root/ir/2026-incident-$(date +%F)
script -q /root/ir/2026-incident-$(date +%F)/terminal.log

Why: You get a simple audit trail of what you ran and when.

Take a snapshot (if your provider supports it)

Do this from the provider panel. If snapshots are fast on your platform, take one before you make major changes.

If you already run automated snapshots, confirm they’re enabled and recent. For a straightforward verification and retention setup, see this snapshot automation guide.
Temporarily restrict inbound traffic to known admin IPs

If it’s safe for your environment, lock down SSH to your current IP first. Example using UFW:
```
ufw status verbose
ufw allow from 203.0.113.10 to any port 22 proto tcp
ufw deny 22/tcp
ufw enable
```
Warning: Ordering matters. Allow your IP before denying. If you want a full, SSH-safe hardening sequence, reference our SSH-safe UFW guide.
Limit outbound traffic if exfiltration is suspected

Outbound blocks are situational, and they can break production in surprising ways. If you see data leaving for unknown networks, a temporary default-deny egress policy can buy time. If that’s too risky, capture traffic (next section) and throttle at the provider level where possible.
Disable suspected credentials (do not delete yet)
```
# List users with shells
awk -F: '$7 ~ /(bash|zsh|sh)$/ {print $1":"$7}' /etc/passwd

# Lock a suspicious user
usermod -L suspicioususer

# Expire password (forces reset)
chage -E0 suspicioususer
```
If keys may be compromised, rotate SSH keys for privileged accounts and consider moving admin access behind a private overlay network. If you’re heading that direction, this Tailscale setup is a practical pattern.

Phase 2: capture evidence (quick, targeted, and stored safely)

You don’t need perfect evidence collection. You need consistent, fast snapshots of what the host looked like before you changed it. Save outputs under your incident directory so you can diff later.

Record system identity and time
```
date -Is
hostnamectl
uname -a
who
```

Process and parent tree snapshot

ps auxfww --sort=-%cpu | head -n 40 | tee /root/ir/2026-incident-$(date +%F)/ps-topcpu.txt
ps auxfww --sort=-%mem | head -n 40 | tee /root/ir/2026-incident-$(date +%F)/ps-topmem.txt

Network connections and listening ports

ss -plant | tee /root/ir/2026-incident-$(date +%F)/ss-plant.txt
ss -s | tee /root/ir/2026-incident-$(date +%F)/ss-summary.txt
ip a | tee /root/ir/2026-incident-$(date +%F)/ip-a.txt
ip r | tee /root/ir/2026-incident-$(date +%F)/ip-r.txt

Capture a short packet trace (60 seconds)
```
timeout 60 tcpdump -i any -nn -s 0 -w /root/ir/2026-incident-$(date +%F)/capture-60s.pcap
ls -lh /root/ir/2026-incident-$(date +%F)/capture-60s.pcap
```
Expected output: a pcap file sized from a few KB to many MB, depending on traffic. If it’s hundreds of MB in 60 seconds, treat that as a real signal.

Auth and sudo activity

journalctl --since "-24 hours" _COMM=sudo --no-pager | tail -n 200 | tee /root/ir/2026-incident-$(date +%F)/sudo-last24h.txt
journalctl -u ssh --since "-24 hours" --no-pager | tee /root/ir/2026-incident-$(date +%F)/ssh-last24h.txt

Changes under common persistence locations

find /etc/cron.* /var/spool/cron -type f -mtime -7 -ls 2>/dev/null | tee /root/ir/2026-incident-$(date +%F)/cron-recent.txt
find /etc/systemd/system /lib/systemd/system -type f -mtime -7 -ls 2>/dev/null | tee /root/ir/2026-incident-$(date +%F)/systemd-recent.txt

Phase 3: diagnose the most common incident patterns

This section is opinionated on purpose. It’s the quickest path to answers for the problems that show up most often on VPS deployments.

Pattern A: CPU spike + unknown process

Identify the process, binary path, and open files

top -o %CPU
ps -p 12345 -o pid,ppid,user,cmd,lstart
readlink -f /proc/12345/exe
lsof -p 12345 | head -n 50

Check if it’s a container or a system service

systemctl status --no-pager --full | head
systemctl status suspicious.service --no-pager --full || true

Quick binary reputation check (offline-friendly)

sha256sum $(readlink -f /proc/12345/exe) | tee /root/ir/2026-incident-$(date +%F)/sha256-suspect.txt
strings -a $(readlink -f /proc/12345/exe) | head -n 50

Don’t upload internal binaries to third-party scanners unless your policy allows it.

Pattern B: outbound traffic surge (possible exfil or cryptominer control)

Find top talkers quickly

ss -tanp | awk 'NR>1 {print $6}' | cut -d, -f2 | sort | uniq -c | sort -nr | head

Map remote IPs to processes

ss -tanp | rg -n "ESTAB" | head -n 30
lsof -i -P -n | head -n 40

Pattern C: disk full + logs exploding

This often presents as an “incident,” but the root cause is usually a logging misconfiguration or a loop that won’t stop. Still: confirm it isn’t intentional log flooding.

Identify what consumed the space

df -h /
du -xhd1 /var | sort -h | tail -n 15
du -xhd1 /var/log | sort -h | tail -n 15

Check journald size
```
journalctl --disk-usage
```

If you need a safe cleanup approach and durable configuration, follow our log rotation best practices so you don’t “fix” disk pressure by deleting the very evidence you’ll want later.

Pattern D: website/API defacement or unexpected code changes

Check recent file modifications in app directories

# Example path for a Node/Go/Python app
APP_DIR=/srv/api-slate
find "$APP_DIR" -type f -mtime -3 -printf '%TY-%Tm-%Td %TT %p\n' | sort | tail -n 40

Validate deployed artifacts against your CI/CD expectations

If you can’t reproduce what’s on disk from your pipeline, assume compromise until you can prove otherwise.

Phase 4: eradicate persistence (carefully)

Once you’ve captured enough to act, start removing persistence. Keep changes small, and keep them reversible.

Audit systemd for suspicious services/timers
```
systemctl list-unit-files --type=service --state=enabled
systemctl list-timers --all | head -n 80
```
Look for units that don’t match your stack. Example suspicious unit file: /etc/systemd/system/net-cache.service.

Disable first, then inspect, then remove

systemctl disable --now net-cache.service || true
systemctl cat net-cache.service || true
ls -l /etc/systemd/system/net-cache.service || true

Save a copy before deletion:

cp -a /etc/systemd/system/net-cache.service /root/ir/2026-incident-$(date +%F)/

Check cron and user startup files

crontab -l || true
ls -la /etc/cron.d
for u in $(cut -d: -f1 /etc/passwd); do crontab -u "$u" -l 2>/dev/null | sed "s/^/[user:$u] /"; done | head -n 200

Hunt for new SSH keys and authorized_keys changes

find /home -maxdepth 3 -name authorized_keys -type f -mtime -30 -ls 2>/dev/null
find /root -maxdepth 2 -name authorized_keys -type f -mtime -30 -ls 2>/dev/null

Phase 5: recovery (restore service without reintroducing the problem)

Recovery should feel uneventful. If you’re improvising under pressure, stop and re-check what you think happened.

Decide: clean in place vs rebuild
- Clean in place fits clear, contained issues (misconfig, runaway logs, known bad deploy).
- Rebuild is usually the right call if you suspect root compromise, unknown persistence, or credential theft.
In 2026, rebuilding from a known-good image and restoring data is often faster than trying to prove a host is clean.
Rotate secrets with a plan

At minimum rotate:
- SSH keys for privileged users
- Application secrets (JWT signing keys, API keys)
- Database passwords and replication credentials
- Object storage credentials used for backups
If your app uses pooled DB connections, plan rotation so you don’t trigger cascading failures. Keep connection pooling behavior in mind during failover and secret rotation.
Restore from backups, then verify integrity

If you want a structured restore workflow with test restores and clear rollback points, model it on this disaster recovery runbook.

Verification: prove the system is stable (and quiet)

Verification isn’t one command you run once. Use a short checklist immediately after changes, then run it again 30–60 minutes later.

Services healthy: systemctl --failed returns none.
Ports expected: ss -lntup shows only known listeners.
Auth quiet: no new suspicious SSH logins; no brute-force spikes.
Resource stable: load average drops, swap stops growing, disk usage trends flat.

systemctl --failed
ss -lntup
journalctl -u ssh --since "-30 min" --no-pager | tail -n 80
uptime
free -h

Common pitfalls (things that slow you down or make it worse)

Rebooting too early: you lose volatile evidence (process tree, network connections). Reboot only after capture.
Deleting logs to free space: rotate and compress instead; move evidence off-host if you can.
Locking yourself out with firewall rules: always allow your admin IP first, then deny broadly.
Changing too many variables at once: small steps are easier to verify and roll back.
Assuming the attacker used SSH: web shells, CI tokens, leaked API keys, and vulnerable plugins are common entry points.

Rollback strategy (how to undo containment and recovery safely)

Rollback doesn’t mean “pretend nothing happened.” It means restoring service without losing the safeguards and context you just gained.

Firewall rollback: keep a saved copy of rules and revert only after you confirm the incident is resolved.
Service rollback: if you disabled a unit, keep the unit file in /root/ir/… and re-enable only if you’re sure it’s legitimate.
Snapshot rollback: if a cleanup step breaks production, rolling back to a snapshot can restore service quickly, but it can also reintroduce compromise. Treat snapshot restore as a temporary bridge to a clean rebuild.

If you already use staging and rollback for updates, the mindset is the same here. The workflow in this patch management guide maps cleanly to incident rollback decisions.

Next steps: make the next incident smaller

Add monitoring that answers “what changed?” Capture logs + metrics with minimal lock-in. For a solid baseline, see our OpenTelemetry Collector monitoring setup.
Standardize server access: use a bastion or private VPN, and remove public SSH when possible.
Practice restore drills: do one quarterly. Time it. Document it.
Write a one-page baseline: known ports, services, deployment method, and where logs live.

Summary

Good incident response feels boring on purpose: contain, capture evidence, diagnose, eradicate persistence, recover, verify—then you can relax. This VPS incident response checklist gives you a repeatable flow that works whether the root cause is a noisy deploy or a real compromise.

If you want a VPS environment where snapshots, clean networking, and predictable Linux performance take some stress out of incident handling, run production on a HostMyCode VPS or step up to managed VPS hosting if you’d rather have experts handle baseline hardening and operational guardrails.

If you’re tightening your ops playbooks for 2026, start with a VPS plan that supports fast snapshots and reliable Linux performance. HostMyCode makes that straightforward on a HostMyCode VPS, and you can hand off routine hardening and monitoring setup with managed VPS hosting.

FAQ

Should I reboot a compromised VPS?

Not first. Rebooting wipes volatile evidence like process trees and active network connections. Capture evidence, contain access, then decide whether a reboot belongs in recovery.

What’s the fastest safe containment step?

Restrict SSH to your known admin IP and confirm you still have access. If you can, also restrict web/admin panels while you investigate.

How do I know if an unknown process is malicious?

Start with provenance: binary path, parent process, systemd unit, and network connections. If it has an odd path (like /tmp/.x), suspicious persistence (a new systemd unit), and unusual outbound connections, treat it as hostile.

Is rebuilding always required after a breach?

If you suspect root compromise or can’t account for persistence, rebuilding from a trusted image is usually the most reliable option. Cleaning in place is only safe when you can fully explain the intrusion path and the changes made.

What logs should I ship off-host for future incidents?

At minimum: SSH auth logs, sudo events, web server access/error logs, and your application logs. Off-host log shipping makes tampering harder and investigations faster.