Linux VPS monitoring with eBPF: a practical 2026 playbook for tracing latency, drops, and CPU hotspots

Most VPS incidents aren’t “the server is down.” They’re quieter and nastier: a 300ms tail-latency jump, intermittent packet loss, one syscall stuck in mud, or CPU steal time you only notice after support tickets pile up. In 2026, Linux VPS monitoring with eBPF is one of the few ways to answer “what changed?” without having instrumented your app ahead of time.

This is a hands-on playbook, and it’s intentionally opinionated. You’ll run a small set of eBPF tools, capture evidence, and stop guessing. The working example is a small SaaS API on Debian 13 behind Nginx, with occasional 502s and latency spikes. You’ll trace TCP retransmits, disk I/O stalls, and CPU hotspots, then confirm each suspicion with repeatable commands.

What you’ll build: an eBPF-first incident workflow (without heavy agents)

You’ll assemble a minimal toolchain: bpftrace for quick one-off scripts, plus BCC/bpf-tools for proven utilities you can run under pressure. Keep it light—no full observability stack required—then decide what (if anything) deserves to stay installed.

Fast diagnosis: catch packet drops/retransmits, syscall latency, and per-process CPU burn.
Low overhead: eBPF runs in-kernel and filters events; you collect only what you ask for.
Verifiable: every claim gets a “prove it” step, with expected output patterns.

If you later want a UI, pair this with a lightweight monitor. HostMyCode already has a practical guide to Beszel for low-resource monitoring, which complements eBPF nicely.

Prerequisites

A Linux VPS with root or sudo. Examples below assume Debian 13 with kernel 6.6+ (many providers ship this by 2026).
Basic shell comfort: SSH, editing files, reading logs.
Optional but helpful: a test endpoint on your app you can curl repeatedly.

Hosting note: eBPF workflows work best with steady CPU and predictable I/O. If you’re running production APIs, a HostMyCode VPS gives you consistent resources and full kernel access, which matters for tracing.

Step 1: Confirm kernel + BTF support (don’t skip this)

Start by confirming your kernel supports modern eBPF features and BTF (BPF Type Format). This is how you avoid the classic failure mode: everything “installs,” then every tool errors out or shows nothing.

Check kernel version:
```
uname -r
```
Expected: something like 6.6.0-* or newer.
Check BTF availability:
```
ls -l /sys/kernel/btf/vmlinux
```
Expected: a readable file. If it’s missing, install your distro’s kernel debug/BTF package (Debian typically includes it with standard kernels in 2026, but hardened/minimal images vary).

Check that tracing filesystem is mounted:

mount | grep -E 'tracefs|debugfs' || true

Expected: tracefs on /sys/kernel/tracing. If not:

sudo mount -t tracefs tracefs /sys/kernel/tracing

Step 2: Install bpftrace + baseline tools on Debian 13

Use distro packages first. They’re boring—in the best way—and they tend to match your kernel better than random binaries.

Install tools:

sudo apt update
sudo apt install -y bpftrace linux-perf bpfcc-tools jq

Sanity check bpftrace:
```
sudo bpftrace -e 'BEGIN { printf("bpftrace ok\n"); exit(); }'
```
Expected: prints bpftrace ok and exits quickly.

If you’re on Ubuntu 24.04 instead, the commands are similar. If your app stack is FastAPI, you may also like HostMyCode’s systemd socket activation guide—useful when restarts are part of your mitigation plan.

Step 3: Create a “case folder” and capture the baseline (2 minutes, saves hours)

Incidents scramble memory. A case folder gives you a clean trail: what you saw, what you ran, and when you ran it.

Create a folder:

sudo mkdir -p /root/ebpf-cases/api-latency-01
cd /root/ebpf-cases/api-latency-01

Capture baseline system state:

date -Is | tee collected_at.txt
uname -a | tee uname.txt
uptime | tee uptime.txt
free -h | tee mem.txt
df -hT | tee disk.txt
ss -s | tee sockets_summary.txt
sudo journalctl -p warning..alert -n 200 --no-pager | tee journal_warn.txt

If Nginx is involved, snapshot error logs:

sudo tail -n 200 /var/log/nginx/error.log | tee nginx_error_tail.txt

If you’re seeing 502s, keep a troubleshooting link nearby: fixing 502 in Nginx often comes down to upstream timeouts, but eBPF helps you prove what actually timed out.

Step 4: Reproduce the symptom with a controlled load (gentle, not a stress test)

You need a repeatable trigger. Don’t “stress test” production—just tap one endpoint on a steady cadence and log latency.

Create a quick probe script (/root/ebpf-cases/api-latency-01/probe.sh):

cat > probe.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
URL="${1:-http://127.0.0.1:8088/health}"
for i in $(seq 1 60); do
  ts=$(date -Is)
  out=$(curl -s -o /dev/null -w 'code=%{http_code} total=%{time_total} connect=%{time_connect} starttransfer=%{time_starttransfer}\n' "$URL" || true)
  printf '%s %s\n' "$ts" "$out"
  sleep 1
done
EOF
chmod +x probe.sh

Run it:
```
./probe.sh http://127.0.0.1:8088/health | tee curl_latency.txt
```
Expected: lines like code=200 total=0.012 .... Leave it running while you trace.

This loop turns eBPF output into something you can line up with reality. Spikes in the probe should have matching evidence in retransmits, I/O latency, or slow syscalls.

Linux VPS monitoring with eBPF for network retransmits (TCP pain in numbers)

“Random slowness” often maps to TCP retransmits. Causes vary: a noisy route, MTU trouble, queue pressure, or a box that can’t keep up with interrupts. Don’t argue with guesses—measure.

Trace TCP retransmits with BCC’s tcpretrans:
```
sudo /usr/sbin/tcpretrans -c 10 | tee tcpretrans_10s.txt
```
Expected: a summary count per interval. Near zero under normal conditions is typical. If it spikes during your curl probe, you’ve got a real lead.
Get per-flow details (short run):
```
sudo timeout 15 /usr/sbin/tcpretrans | tee tcpretrans_detail_15s.txt
```
Expected: lines showing IP:port pairs. Watch for your Nginx listener (e.g., 443/80) or app port (here, 8088).
Verification step: check interface stats for drops/errors:
```
ip -s link | tee ip_link_stats.txt
```
Expected: low dropped and errors. If drops climb, you may be hitting queue limits or bursty traffic.

If you need to reach a private server through a bastion, keep your access method safe. HostMyCode’s reverse SSH tunnel playbook is a solid pattern when inbound SSH isn’t an option.

Step 5: Find CPU hotspots per process (catch the noisy neighbor inside your own box)

CPU trouble rarely shows up as neat “100% user.” Under load you’ll see kernel time, softirqs, and syscall overhead eat the budget.

Start with a simple snapshot:

ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -n 15 | tee top_procs_cpu.txt

Then use profile from bpftrace to sample stack traces (10 seconds):
```
sudo timeout 10 bpftrace -e 'profile:hz:99 { @[comm] = count(); }' | tee cpu_profile_comm.txt
```
Expected: a command-name breakdown with counts. If nginx, your app, or postgres dominates during a spike, you know where to focus.
Verification step: check for CPU steal time (virtualization pressure):
```
mpstat -P ALL 1 5 | tee mpstat_steal.txt
```
Expected: %steal near 0 on a healthy VPS. Consistent steal during slow periods points to host contention; mitigation may include resizing or moving tiers.

For production workloads where you don’t want to babysit noisy neighbors, consider managed VPS hosting—not because “someone else runs Linux,” but because escalation is faster when the problem is below your VM.

Step 6: Trace disk I/O latency (the hidden source of 502s and slow DB calls)

Even on NVMe-backed plans, brief stalls can blow up request latency if they land on a hot path. Common culprits: logging, session writes, SQLite, or a busy database.

Quick view of I/O wait:
```
iostat -xz 1 5 | tee iostat_xz.txt
```
Expected: low %util and reasonable await. If await jumps into tens or hundreds of ms during curl spikes, keep going.
Use biolatency to see block I/O latency buckets (10 seconds):
```
sudo timeout 10 /usr/sbin/biolatency -D 1 | tee biolatency_10s.txt
```
Expected: a histogram. Healthy boxes cluster in low ms. A long tail (e.g., 50–500ms buckets) that lines up with request spikes is a problem.
Verification: identify which processes are doing I/O:
```
sudo timeout 10 /usr/sbin/opensnoop -T | tee opensnoop_10s.txt
```
Expected: file opens with process names. If you see constant opens of big logs, temp directories, or DB files, you can often fix it with buffering, log tuning, or moving a DB off-box.

If your database is the real bottleneck, separate it early. HostMyCode’s database hosting options can remove noisy disk contention from your API node.

Step 7: Measure syscall latency (pinpoint “slow requests” that are really slow syscalls)

Some spikes come down to one syscall dragging: fsync(), connect(), accept(), or a slow resolver path in libc. eBPF lets you time those calls without touching your code.

Trace slow syscalls by duration (bpftrace one-liner). This example logs any syscall taking > 50ms:

sudo timeout 20 bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  $d = (nsecs - @start[tid]) / 1000000;
  if ($d > 50) { printf("%s tid=%d syscall=%d %dms\n", comm, tid, args->id, $d); }
  delete(@start[tid]);
}' | tee slow_syscalls_20s.txt

Expected: ideally nothing, or only occasional entries. If you see repeated slow exits during your curl spikes, that’s a concrete lead.

Verification: correlate time windows. Find timestamps in curl_latency.txt where total spiked, then compare with slow-syscall lines captured in the same period.

If the syscall ID isn’t obvious, map it quickly:

ausyscall --dump | head

(Install auditd utilities if missing.) For deeper audit-driven detection, see HostMyCode’s guide to auditd log monitoring and alerting.

Step 8: Trace Nginx upstream latency without changing Nginx config

If you’re hitting 502/504, the upstream may be slow, wedged, or simply not accepting connections quickly enough. You can instrument the app, but you can also watch the system calls and TCP connects that make the request path work.

Watch TCP connects from Nginx to your upstream port (example: upstream on 127.0.0.1:8088). Use tcpconnect:
```
sudo timeout 15 /usr/sbin/tcpconnect -p 8088 | tee tcpconnect_upstream_15s.txt
```
Expected: connect events showing PID/command. If you see lots of short-lived connects, consider upstream keepalive or unix sockets. If connects stall, that’s a different problem (backlog, SYN retries, local firewall rules).
Verification step: check backlog and listen queues:
```
ss -lntp | grep -E ':8088|:80|:443' | tee listen_queues.txt
```
Expected: Recv-Q should not sit high. Persistently high Recv-Q suggests backlog pressure (app not accepting fast enough).

If the symptom is a 504 (upstream timeout), keep a separate checklist: 504 troubleshooting patterns still apply even if you’re not using PHP-FPM—the core issue is the same: upstream timeouts usually mean capacity limits or a dependency stalling.

Step 9: Turn your findings into a small, repeatable “incident script”

Once you know which commands pay off, package them. You want a single “run this now” script that collects the same bundle every time, for clean comparisons.

Below is a minimal collector that runs for 20 seconds and writes to a timestamped directory.

Create /usr/local/sbin/ebpf-quicktriage:

sudo tee /usr/local/sbin/ebpf-quicktriage > /dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
D="/root/ebpf-cases/quicktriage-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$D"
cd "$D"

uname -a > uname.txt
date -Is > collected_at.txt
ss -s > sockets_summary.txt
ip -s link > ip_link_stats.txt

# 20s captures
(timeout 20 /usr/sbin/tcpretrans -c 1 || true) > tcpretrans.txt
(timeout 20 /usr/sbin/biolatency -D 1 || true) > biolatency.txt
(timeout 20 bpftrace -e 'profile:hz:99 { @[comm] = count(); }' || true) > cpu_profile_comm.txt

echo "Wrote triage bundle to $D"
EOF
sudo chmod +x /usr/local/sbin/ebpf-quicktriage

Run it during a spike:
```
sudo /usr/local/sbin/ebpf-quicktriage
```
Expected: prints a directory path and creates files you can compare between “good” and “bad” periods.

Common pitfalls (the stuff that makes people give up on eBPF)

Kernel mismatch: old kernels or missing BTF produce confusing failures. Step 1 avoids most of that.
Running everything at once: eBPF is efficient, but piling on 10 tracers in production is still a bad idea. Use short timeouts and stay focused.
Blaming the first scary metric: retransmits can be a symptom of CPU starvation, not a “bad network.” Check steal time and softirq load before pointing fingers.
No reproduction loop: without a curl probe or known trigger, you’re staring at noise. Keep Step 4 ready.
Permissions: most tools require root. If you must delegate, use tightly scoped sudo rules for specific binaries.

Rollback plan (how to back out cleanly)

This workflow is designed to be low-risk. Still, you should be able to unwind everything quickly.

Stop any running tracers:

sudo pkill -f bpftrace || true
sudo pkill -f tcpretrans || true
sudo pkill -f biolatency || true

Remove tools if you no longer want them installed:
```
sudo apt remove -y bpftrace bpfcc-tools
```

Remove your helper script (optional):

sudo rm -f /usr/local/sbin/ebpf-quicktriage

Keep the case folders for postmortems. If disk is tight, archive:
```
sudo tar -C /root -czf /root/ebpf-cases-archive.tgz ebpf-cases
```

If rollback is part of a bigger incident response, pair it with a real DR routine. HostMyCode’s VPS disaster recovery runbook is a good checklist for restore tests and fast reversions.

Next steps (make this useful after the incident is over)

Turn the best two tracers into runbooks: for many teams, that’s tcpretrans and biolatency. Keep a “good period” capture for comparison.
Add lightweight dashboards: eBPF isn’t a time-series database. Use a small monitoring tool for CPU, memory, disk, and basic latency; save eBPF for deep dives.
Fix the proven bottleneck: if disk stalls show up, buffer logs, reduce sync writes, or split DB/API nodes.
Stabilize production: if steal time is consistent, resize or change plan. No amount of tuning inside the guest fixes host contention.

If you plan to use eBPF regularly, pick a VPS where you can rely on kernel features and predictable performance. A HostMyCode VPS is a solid baseline, and managed VPS hosting helps when the incident turns out to be host-level contention or tricky networking.

FAQ

Is Linux VPS monitoring with eBPF safe to run in production?

Yes, if you keep sessions short and targeted. Use timeout, collect for 10–30 seconds, and avoid stacking multiple high-frequency tracers at once.

Do I need to recompile my app or enable special flags?

No. The point of eBPF here is observing kernel events and syscalls without changing application code. You’ll get better results if your binaries have symbols, but it’s not required.

Why do my eBPF tools fail even though they installed?

The usual causes are missing BTF (/sys/kernel/btf/vmlinux), an old kernel, or tracefs not mounted. Recheck Step 1 and confirm kernel compatibility.

What’s the quickest “first command” when users report random latency spikes?

Run tcpretrans -c 10 for 10 seconds while you probe the endpoint. If retransmits spike exactly when latency spikes, you’ve narrowed the problem dramatically.

Should I replace my existing monitoring stack with eBPF?

No. Use your normal metrics/logs for trends and alerting, and use eBPF as the surgical tool for root cause analysis when the graphs say “something is wrong” but not “what.”

Summary

eBPF gives you evidence you can act on: retransmits instead of “the network feels slow,” block latency instead of “disk might be busy,” and syscall timing instead of “the app hung.” Keep captures short, keep the probe loop running, and you’ll resolve more VPS incidents with fewer panicked changes.

If you want a stable home for this workflow, run it on a VPS with consistent performance and full Linux control, like a HostMyCode VPS from HostMyCode (Affordable & Reliable Hosting).