VPS hosting troubleshooting checklist: fix slow sites, 502 errors, and random downtime in 2026

A “slow site” report is rarely one clean issue. It’s a symptom. Common causes include CPU steal, disk I/O wait, PHP-FPM saturation, noisy MySQL queries, or a DNS/SSL mistake that affects only some visitors.

The quickest path back to stable service is a VPS hosting troubleshooting checklist. Start with user impact. Then move down the stack in a fixed order.

This is a runbook-style flow for 2026. Use it at 2 a.m. when a WordPress store is throwing 502s, Nginx access logs look normal, and uptime monitors won’t stop flapping.

Start with a 3-minute triage: what’s broken, for whom, and since when?

Before you touch the VPS, lock in three facts. They keep you from debugging the wrong layer.

Scope: one URL, one domain, or everything on the VPS?
Failure mode: slow (TTFB high), error (502/504/500), or hard-down (connection refused/timeouts)?
Timeline: did it start after a deploy, plugin update, SSL renewal, or traffic spike?

Quick external checks from your laptop:

curl -I https://example.com (look at status, headers, and response time)
curl -Iv https://example.com (TLS handshake details; useful for cert chain/SNI problems)
dig +short A example.com and dig +short AAAA example.com (confirm DNS returns what you expect)

If failures are intermittent, test from two networks (mobile + office).

A bad IPv6 (AAAA) record can look “random,” because only some clients prefer IPv6.

Confirm the VPS itself is healthy (CPU, RAM, disk, and network)

If the host is struggling, everything above it becomes misleading. Start with these basics.

Load + CPU saturation: uptime, top or htop
Memory pressure: free -h, and look for swap churn
Disk space: df -h (a full root partition breaks logging, MySQL temp files, and PHP sessions)
Disk I/O wait: iostat -xz 1 (install via sysstat)
Network errors: ip -s link and ss -s

Two patterns show up constantly on production VPSes:

High load with high wa (I/O wait): the CPU isn’t “busy.” It’s waiting on disk. Think slow queries, logging storms, backups running at peak time, or a saturated volume.
Normal load but users still time out: you’re likely hitting a connection bottleneck (PHP-FPM workers, MySQL max connections, or a proxy timeout). This is different from raw resource exhaustion.

If you’re running into host-level limits and need consistent headroom, a HostMyCode VPS plan with dedicated resources can help.

It reduces “noisy neighbor” variables and gives you a cleaner baseline while troubleshooting.

Diagnose “hard down”: is the service listening and reachable?

A true outage usually comes down to one of three things. The web server stopped. The firewall blocked it. Or the process is running but not listening where you think it is.

Is the port open locally? ss -lntp | egrep ':80|:443'
Is the service running? systemctl status nginx or systemctl status apache2
Are you blocked at the firewall? check UFW (ufw status) or nftables rules (nft list ruleset)

If Nginx/Apache won’t start, don’t guess. Ask the daemon what it’s refusing to load.

Nginx: nginx -t (syntax + include errors)
Apache: apachectl -t and journalctl -u apache2 -n 200 --no-pager

A classic “it worked yesterday” cause is a failed log write because the disk filled up.

If df -h shows 100% usage on / or /var, free space first. Then restart services.

Fix 502/504 errors by separating web, PHP, and upstream timeouts

Most VPS stacks in 2026 look like: Nginx (or LiteSpeed) → PHP-FPM → MySQL/MariaDB.

A 502/504 usually means the proxy didn’t get a timely response from an upstream.

That upstream is often PHP-FPM, not the web server itself.

First, pinpoint where the timeout happens:

Nginx error log: typically /var/log/nginx/error.log
PHP-FPM log: varies by distro/version, commonly /var/log/php8.3-fpm.log or journalctl -u php8.3-fpm
App log: WordPress often writes to wp-content/debug.log if enabled

Then look for saturation and queuing:

PHP-FPM pool status: check pm.max_children and whether you’re hitting it. Pool config is typically under /etc/php/8.3/fpm/pool.d/www.conf (Debian/Ubuntu).
Active connections: ss -ant | wc -l and ss -ant state established '( sport = :443 )' | head

Mid-incident mitigation is fine, but be careful.

You can temporarily raise pm.max_children only if you have free RAM.

Otherwise you’ll swap or trigger OOM kills. That’s worse than a clean 502.

If the kernel starts killing processes, you’ll see it in dmesg -T | tail -n 50.

If you run a reverse proxy in front of an app service, treat timeouts as part of the design.

A too-low proxy_read_timeout converts real backend slowness into noisy 504s.

A too-high timeout can tie up connections long enough to starve workers.

If you want a solid Nginx + TLS proxying baseline to compare against, this guide is a good reference: Nginx SSL reverse proxy configuration guide.

Track down “slow site” complaints with one metric: TTFB

Users say “slow.” You need to decide what “slow” means in your stack.

It could be network latency, TLS handshake time, static delivery, or backend generation time.

Time to first byte (TTFB) is the quickest divider.

Low TTFB, slow overall: large pages, heavy images, missing compression, or browser-side work.
High TTFB: backend delay (PHP, database, external APIs, or overloaded workers).

Quick check with curl:

curl -o /dev/null -s -w 'namelookup:%{time_namelookup} connect:%{time_connect} tls:%{time_appconnect} ttfb:%{time_starttransfer} total:%{time_total}\n' https://example.com/

If time_namelookup spikes, you’re likely dealing with DNS (or resolver settings on the VPS).

If TLS time spikes, inspect the certificate chain and stapling. Also confirm you aren’t serving an outdated intermediate.

Database bottlenecks: the quiet cause behind PHP timeouts

On WordPress and many PHP apps, the database is where “random slowness” tends to hide.

One missing index can turn a 20 ms query into a 2-second problem under load.

PHP-FPM backs up, Nginx hits its upstream timeout, and you end up staring at 504s.

Two quick signals:

MySQL/MariaDB CPU: top shows mysqld burning CPU, often with load climbing.
Thread/connection pressure: slow queries cause connection pileups and “too many connections.”

If you don’t already have the slow query log enabled, enable it.

It’s one of the highest-ROI switches you can flip on a VPS.

HostMyCode has a hands-on walkthrough here: MySQL slow query log tutorial.

Also watch disk behavior while the database is under stress.

If the database volume is near capacity, InnoDB can struggle during temp table creation and write bursts.

For database-heavy sites where you want cleaner boundaries, consider HostMyCode database hosting.

It lets the web tier scale independently, while the database sits on storage tuned for endurance and latency.

DNS, SSL, and redirect loops: outages that look like “server problems”

A lot of “server is down” tickets aren’t CPU or RAM issues.

They’re edge misconfigurations that show up as timeouts or “site can’t be reached.”

DNS points to the wrong IP: common after migrations or when an old AAAA record lingers.
SSL mismatch: wrong certificate served for a hostname due to missing SNI server block.
Redirect loops: HTTP→HTTPS rules fighting with WordPress “site URL” settings or proxy headers.

Practical checks:

dig A example.com +short and compare to your VPS public IP.
openssl s_client -connect example.com:443 -servername example.com -showcerts to confirm the served cert chain.
curl -I http://example.com and curl -I https://example.com to see redirect paths.

If you move projects between servers often, centralizing DNS helps during cutovers.

HostMyCode domains and DNS can reduce “which provider holds this record?” confusion while you’re mid-migration.

Disk-full incidents: the most preventable downtime on a VPS

Disk-full failures cascade fast. Logs stop writing. Databases can’t create temp files. Services refuse to restart.

The upside is that most disk incidents are preventable once you treat log growth and backups as real operational risks.

When space is tight, find the biggest offenders quickly:

du -xhd1 /var | sort -h
journalctl --disk-usage (systemd journal can grow quietly)
find /var/log -type f -size +200M -printf '%p %s\n' | sort -n | tail

Then add guardrails.

If logrotate exists but isn’t catching custom app logs, tune it and validate with a dry run.

This guide covers a reliable baseline: log rotation setup with logrotate and systemd.

Email alerts, contact forms, and “my site can’t send mail” incidents

Website mail is deceptively messy. A working PHP mail() call doesn’t mean messages land in inboxes.

In 2026, providers rate-limit aggressively. They also junk mail that lacks aligned DNS and a clean sending identity.

If WordPress password resets and WooCommerce receipts stopped arriving, check:

DNS auth: SPF, DKIM, DMARC
Reverse DNS: rDNS matches your sending hostname
Queue growth: Postfix queue filling indicates upstream blocks or auth failures

HostMyCode’s deliverability checklist is a solid reference: VPS email deliverability checklist for 2026.

If email is business-critical, treat sending as its own system with monitoring and clear DNS ownership.

“It worked on staging” doesn’t survive real-world reputation and rate limits.

Incidents after a migration: the 6 things that break most often

Migrations usually don’t fail because rsync missed files.

They fail because one small dependency stayed pointed at the old environment.

DNS TTL wasn’t lowered: users keep hitting the old IP for hours.
Firewall rules differ: new VPS blocks SMTP, Redis, or SSH from your office IP.
Missing PHP extensions: the app loads but breaks under specific paths.
Different default PHP settings: upload_max_filesize, memory_limit, max_execution_time.
Wrong file ownership/permissions: cache and upload directories become unwritable.
SSL automation not installed: cert renewals fail or the wrong vhost serves traffic.

If you want a battle-tested sequence for near-zero downtime moves, use this as a runbook: VPS migration checklist.

Make your troubleshooting repeatable: lightweight monitoring beats guesswork

In the middle of an incident, “what changed?” matters more than “what do I see right now?”

Baseline monitoring belongs in the troubleshooting toolkit because it answers that question quickly.

At minimum, collect:

Host metrics: CPU, load, RAM, swap, disk space, disk I/O wait
Service checks: HTTP 200 on key endpoints, TLS expiry, DNS resolution
Logs: Nginx/Apache errors, PHP-FPM errors, database errors

This HostMyCode post lays out a useful metric set and alert thresholds that map to real outages: Linux VPS performance monitoring in 2026.

You don’t need an elaborate stack to get value.

Alerts for “disk 90%,” “load 2× baseline,” and “HTTP 5xx spike” will cut downtime dramatically.

Summary: the order that saves time (and prevents repeat incidents)

A good VPS hosting troubleshooting checklist feels boring because it’s consistent.

Confirm VPS health. Verify services are reachable. Then isolate 502/504s across web, PHP, and database.

Only after that should you start tweaking configs.

If every incident feels ambiguous because the underlying platform is unpredictable, consider moving to infrastructure with clearer baselines and reachable support.

Start with a managed VPS hosting plan for production sites that need fast response, or deploy on a HostMyCode VPS when you want full control without sacrificing reliability.

If you run client sites or revenue-critical WordPress on a VPS, small outages add up fast. HostMyCode offers managed VPS hosting if you want a stable stack, predictable updates, and help during incidents, plus flexible HostMyCode VPS plans when you prefer to self-manage.

FAQ

What’s the fastest way to tell if a 502 is Nginx or PHP-FPM?

Check /var/log/nginx/error.log for “upstream” errors. Then check journalctl -u php8.3-fpm (or your PHP-FPM unit) for worker exhaustion, crashes, or slow script warnings.

My VPS load is high, but CPU usage is low. Why?

Look for high I/O wait (wa) and slow storage. Use iostat -xz 1. Database writes, backups, or log spikes can inflate load while the CPU mostly waits on disk.

How do I confirm DNS isn’t causing “random” outages?

Run dig A and dig AAAA from two networks. Confirm both point to valid, reachable IPs. A stale AAAA record is a common cause of intermittent failures for IPv6-preferred clients.

What should I do first if the disk is full?

Free space safely (large logs, old backups, runaway journals), then restart affected services. After recovery, fix the cause with logrotate rules, retention policies, and monitoring alerts.

When is it time to move from a VPS to a dedicated server?

If you consistently hit CPU/RAM limits, need higher sustained I/O, or want strict isolation for multiple high-traffic sites, a dedicated server removes contention and simplifies capacity planning.