VPS disk space troubleshooting: find what’s filling your Linux server and fix it safely (2026)

Disks don’t “mysteriously” fill up. On a Linux VPS, it’s usually one of five culprits: logs, container layers, package caches, runaway app data, or orphaned files sitting under a mounted path you assumed was somewhere else. VPS disk space troubleshooting comes down to proving which one it is, then removing the right data without taking your service down.

This is a repeatable, low-risk playbook for developers and sysadmins. You’ll follow a simple flow: confirm what’s actually failing, identify the biggest consumers with hard evidence, clean with guardrails, verify, and then prevent the same incident from returning.

What you’ll accomplish (and what you won’t)

By the end, you’ll be able to:

Confirm whether you’re out of disk space or out of inodes (they fail differently).
Identify the top 3 space consumers with evidence (paths, sizes, timestamps).
Clean safely with a rollback mindset (move/rotate first, delete second).
Prove you fixed it (before/after snapshots, service checks).
Prevent repeats using log policy, container hygiene, and alerting.

You won’t get a one-command “clean everything” script. Those are how you delete the wrong thing at 2 AM.

Prerequisites and a safe starting posture

You need SSH access with sudo. Examples assume Debian 12 or Ubuntu 24.04 LTS, but the workflow holds on most modern distros.

Ability to run commands as root: sudo -i
Basic familiarity with systemd and journals
If this VPS is production, create a snapshot first (5 minutes now saves hours later)

If you want the simplest rollback lever for server-level mistakes, use infrastructure that supports snapshots and fast restores. A HostMyCode VPS is a solid base for this kind of ops work: predictable disk performance, full root access, and the option to scale storage when cleanup isn’t enough.

Step 1: Confirm the failure mode (space vs inodes)

“No space left on device” can mean two different things. Check both disk usage and inode usage so you don’t chase the wrong problem.

df -hT

df -i

What you’re looking for:

Space problem: a filesystem at 95–100% in df -hT.
Inode problem: IUse% near 100% in df -i (millions of tiny files).

Expected output (example):

Filesystem     Type   Size  Used Avail Use% Mounted on
/dev/vda1      ext4    80G   79G  200M 100% /

Filesystem     Inodes  IUsed   IFree IUse% Mounted on
/dev/vda1      5.2M   420K    4.8M    9%  /

That’s a pure space issue. If inodes are at 100%, jump ahead to the inode pitfalls section and focus on file counts, not file sizes.

Step 2: Identify which mount is filling (and catch “hidden under mount” mistakes)

A common failure pattern: you believe the app is writing to /data (a big volume), but the mount didn’t come up and everything is landing on /. Confirm mounts first, then look at usage.

lsblk -f

findmnt -rno TARGET,SOURCE,FSTYPE,OPTIONS | column -t

df -hT | sed -n '1p;/\s\/$/p;/\s\/var\b/p;/\s\/home\b/p'

Red flags:

A directory like /srv or /data exists but isn’t listed in findmnt.
/var sits on the root filesystem and is nearly full (classic logs + containers situation).

Step 3: Get a fast “top offenders” view without freezing the box

Start wide, then narrow. On a busy VPS, don’t blindly crawl everything and hope for the best.

3.1. Top-level directory sizes (quick)

sudo du -xhd1 / 2>/dev/null | sort -h

-x keeps you on the same filesystem, so other mounts don’t distort the result.

3.2. Drill into the biggest directory (repeat)

sudo du -xhd1 /var 2>/dev/null | sort -h

Common outcomes:

/var/log huge → log growth or journald retention.
/var/lib/docker huge → images, layers, build cache.
/var/lib/postgresql huge → DB growth, WAL retention, backups.
/home huge → user dumps, artifact caches, CI workspaces.

Step 4: Find large files and recent growth (evidence before cleanup)

Once you’ve narrowed the blast radius (for example, /var), pull a short list of what’s big and what’s growing.

4.1. Largest files (top 30)

sudo find /var -xdev -type f -size +200M -printf '%s %TY-%Tm-%Td %TH:%TM %p\n' 2>/dev/null \
  | sort -nr \
  | head -n 30 \
  | awk '{printf "%.1fG %s %s\n", $1/1024/1024/1024, $2" "$3, $4}'

4.2. Recently modified big files (last 48h)

sudo find /var -xdev -type f -mtime -2 -size +100M -printf '%TY-%Tm-%Td %TH:%TM %s %p\n' 2>/dev/null \
  | sort -r \
  | head -n 40

Your goal is simple: answer which path is growing right now. That answer drives both cleanup and prevention.

Step 5: Clean the common culprits safely (logs, journals, caches, containers)

Deleting data is trivial. Deleting the right data, in the right order, is what keeps you out of a second outage. Confirm what you’re touching, reduce first where you can, and only then delete.

5A) journald: cap retention instead of nuking logs

Systemd journals can grow quietly on noisy hosts. Check the current footprint:

journalctl --disk-usage

Example output:

Archived and active journals take up 3.8G in the file system.

To reclaim space without wiping everything, vacuum down to a cap (example: keep 500 MB):

sudo journalctl --vacuum-size=500M

Then make the limit persistent by editing /etc/systemd/journald.conf:

sudo nano /etc/systemd/journald.conf

[Journal]
SystemMaxUse=800M
SystemKeepFree=2G
MaxRetentionSec=7day

sudo systemctl restart systemd-journald
journalctl --disk-usage

If you already manage logs deliberately, align this with your broader retention rules in VPS log rotation best practices in 2026.

5B) /var/log: rotate, compress, and remove truly stale archives

Start by finding the worst offenders:

sudo du -sh /var/log/* 2>/dev/null | sort -h | tail -n 20

If Nginx/Apache/app logs are growing fast, force a rotation:

sudo logrotate -f /etc/logrotate.conf

Then confirm what changed:

sudo ls -lh /var/log | head
sudo du -sh /var/log

If you’re sitting on multi-GB .gz archives from months ago, remove old compressed logs carefully (example: older than 30 days). List first:

sudo find /var/log -type f \( -name '*.gz' -o -name '*.1' -o -name '*.old' \) -mtime +30 -print

Review the list, then delete:

sudo find /var/log -type f \( -name '*.gz' -o -name '*.1' -o -name '*.old' \) -mtime +30 -delete

5C) Package caches: low risk, quick win

On Debian/Ubuntu:

sudo apt-get clean
sudo apt-get autoremove --purge -y

On RHEL-family (Rocky/Alma):

sudo dnf clean all
sudo dnf autoremove -y

On long-lived servers, this often buys back hundreds of MB to a few GB with minimal risk.

5D) Docker/Podman: prune with intent (and avoid deleting what’s running)

If you run containers, /var/lib/docker can swell from old images and build cache. Start with a quick inventory:

sudo docker system df

Expected output looks like:

TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          18        4         22.4GB    17.1GB (76%)
Containers      6         3         1.2GB     700MB (58%)
Local Volumes   12        5         9.6GB     5.2GB (54%)
Build Cache     148       0         6.8GB     6.8GB

Safer cleanup order:

Prune build cache (usually low risk):

sudo docker builder prune -af

Remove unused images (not referenced by any container):

sudo docker image prune -af

Remove stopped containers and unused networks:

sudo docker container prune -f
sudo docker network prune -f

Be careful with volumes. “Dangling” doesn’t always mean “safe.” List candidates first:

sudo docker volume ls -qf dangling=true

If you recognize them as safe to delete:

sudo docker volume prune -f

Step 6: The sneaky case — space is “used” but you can’t find it

If df says the disk is full but du can’t account for it, suspect a deleted file that’s still open. The filename is gone, but the process is still holding the blocks.

Find open-but-deleted files:

sudo lsof +L1 | head -n 50

You might see something like:

nginx   1234  www-data  10w  REG  253,0  2147483648  0 /var/log/nginx/access.log (deleted)

Fix options:

Restart the service to close the file (often easiest): sudo systemctl restart nginx
Truncate via /proc if you can’t restart (more careful):

sudo sh -c ': > /proc/1234/fd/10'

Then re-check:

df -h /
journalctl --disk-usage 2>/dev/null || true

Step 7: Verification checklist (prove you’re safe again)

Disk cleanup isn’t done until you’ve confirmed two things: the filesystem has headroom, and the workload is healthy.

Disk headroom:

df -hT /

Critical services running:

sudo systemctl --failed
sudo systemctl status nginx 2>/dev/null | sed -n '1,8p' || true
sudo systemctl status docker 2>/dev/null | sed -n '1,8p' || true

Application-level check (example on port 8088):

curl -fsS http://127.0.0.1:8088/health || echo "health check failed"

If you run databases, do a quick “can I still read/write?” sanity check. For PostgreSQL:

sudo -u postgres psql -tAc "SELECT now();"

Common pitfalls (the stuff that bites experienced people)

Deleting logs without fixing rotation: you get back 10 GB today, then lose it again next week.
Pruning container volumes blindly: “unused” is not the same as “unimportant.” Volumes may hold your only copy of uploaded files.
Filling root because a mount failed: your app writes to /data/uploads, but /data isn’t mounted after reboot.
Inodes exhausted: millions of tiny cache files can block writes even if df -h looks fine.
Large deleted files still open: you delete a 5 GB log and disk usage doesn’t drop until a restart.

If you want a broader safety net for incidents like this, keep a runbook close. The workflow in VPS Incident Response Checklist (2026) fits disk incidents well: isolate, observe, then change.

Rollback plan: how to undo an over-aggressive cleanup

Rollback depends on what you touched. These options come up often in real incidents:

If you removed packages: reinstall from your package manager history (Debian/Ubuntu: /var/log/apt/history.log).
If you rotated logs and an app expected a path: restart the service so it recreates the file, then confirm ownership and permissions.
If you pruned images and deploys fail: pull/build again; pin versions in your deploy pipeline to avoid “latest” surprises.
If you deleted the wrong directory: restore from backup or snapshot.

This is where automated backups earn their keep. If you don’t already practice restores, the approach in VPS Backup Strategy 2026 gets you to “I can actually recover this” quickly.

Prevention: keep disk incidents from coming back

After you’ve recovered, spend 20 minutes making sure the next alert doesn’t turn into the same fire drill.

Set a disk alert at 80/90% (not 99%)

Alerts at 99% show up after services have started failing. If you want vendor-neutral monitoring patterns, see VPS monitoring with OpenTelemetry Collector and add a filesystem usage rule.

Put hard limits where logs can explode

Cap journald size (as shown earlier).
Ensure logrotate compresses and keeps a sane retention window per service.

Containers: schedule hygiene for build cache and unused images

On CI/build-heavy hosts, build cache is a frequent offender. A weekly systemd timer that runs docker builder prune -af is often enough. Write it down, and keep an eye on reclaimable space so you can adjust.

Backups and snapshots: treat them as a disk consumer

Local backups stored on the same disk they protect will fill it eventually. If you must stage locally, enforce retention (count-based or age-based) and verify restores. Pair weekly images with daily file backups; the workflow in VPS snapshot backup automation is a solid baseline.

Next steps (a short, realistic plan)

Write down what filled the disk (path + process) and keep it in your ops notes.
Add a threshold alert at 80% and 90% for / and any data volumes.
Implement one retention fix (journald cap or logrotate rule) the same day, while the incident is fresh.
Schedule a restore test if you had to delete anything risky to recover.

Summary: keep cleanup boring, evidence-driven, and reversible

Good disk recovery is calm work. Confirm whether you’re dealing with space or inodes, identify the real consumers, clean in a controlled order, and verify the workload. Then lock in retention and alerting so you don’t repeat the cycle.

If you’re running production services and want fewer late-night surprises, consider putting them on managed VPS hosting or a right-sized HostMyCode VPS. Either way, you get a stable platform where expanding disk, taking snapshots, and rolling back changes isn’t a fight.

If your server regularly runs tight on storage—containers, logs, CI artifacts, or databases—start with a VPS that’s sized correctly and easy to snapshot. Use a HostMyCode VPS, or offload the operational risk with managed VPS hosting for production workloads.

FAQ

Why does `df` show 100% but `du` can’t find the usage?

Usually because a large file was deleted but a running process still holds it open. Use lsof +L1, then restart the service or truncate via /proc/<pid>/fd/<fd>.

Is it safe to run `docker system prune -a` on a production VPS?

It’s safe only if you fully understand what will be removed. It can delete images you rely on for quick rollbacks. Prefer targeted pruning: build cache first, then unused images, then stopped containers. Treat volumes with extra caution.

What disk usage percentage should trigger an alert in 2026?

Alert at 80% (early warning) and 90% (urgent). Waiting until 95–99% often means services fail before you can respond.

How do I handle inode exhaustion?

Use df -i to confirm, then locate directories with huge file counts (often caches or mail queues). Deleting millions of tiny files can take time; stop the generating service first, then remove in batches and consider moving that workload to a separate filesystem.

Should I store backups on the same VPS disk?

Only as a temporary staging step with strict retention. Long-term backups belong off-host (object storage, another VPS, or a managed backup target) so disk incidents don’t wipe both the server and its backups.