Subscribe and receive upto $1000 discount on checkout. Learn more
Subscribe and receive upto $1000 discount on checkout. Learn more
Subscribe and receive upto $1000 discount on checkout. Learn more
Subscribe and receive upto $1000 discount on checkout. Learn more
Monitor Linux Server Health Using Built-in Tools

When “it’s fine” quietly turns into an incident

Most Linux servers do not fail loudly. They drift. A little more traffic this month. A new background job next quarter. A database that grows just enough to change I/O patterns. A kernel update that subtly shifts scheduling behavior. And then one day, a routine deploy takes longer than usual, SSH feels sluggish, and the team starts asking the question we all recognize: “Is the server healthy, or are we just getting lucky?”

Linux already ships with the tools we need to answer that question—without agents, without heavyweight stacks, and without introducing new moving parts. In this guide, we will build a practical, production-grade approach to Linux Performance Monitoring using built-in CLI tools and lightweight & free tooling. We will focus on repeatability, security, and persistence across reboots, so the same approach works at home, in a professional environment, and at enterprise scale.

Prerequisites and system assumptions

Before we touch commands, we will align on assumptions. This matters because performance tooling is only useful when it is consistently available, consistently executed, and consistently interpreted.

  • Platform: Linux. Commands are written for modern distributions using systemd (common on Ubuntu, Debian, RHEL, Rocky, AlmaLinux, SUSE).
  • Access: We need a shell with sudo privileges. We will avoid running interactive tools as root unless required.
  • State: This works on both clean installs and long-lived servers. If the server is extremely minimal, we will install a small set of packages (still lightweight & free).
  • Security posture: We will not open any inbound ports. Everything here is local CLI. Firewall changes are not required, and that is intentional.
  • Operational goal: We will cover two modes:
    • Interactive triage: When something feels off and we need answers now.
    • Baseline + persistence: When we want recurring snapshots and logs that survive reboots and help us compare “today vs last week”.

We will also keep a simple rule: we will measure first, change second. Performance work becomes risky when we “optimize” without evidence.

Step 1: Establish a clean baseline of the host

Before we look at CPU graphs or disk queues, we will capture identity and context: kernel, uptime, load, memory pressure, and what the system thinks it is doing. This baseline is what we will compare against later.

Capture OS, kernel, uptime, and load

We are going to print OS release details, kernel version, uptime, and load averages. This gives us immediate context: whether we are dealing with a recent kernel, how long the host has been running, and whether load is persistently high or just a spike.

set -eu

echo "=== OS Release ==="
if [ -r /etc/os-release ]; then
  cat /etc/os-release
else
  echo "/etc/os-release not found"
fi

echo
echo "=== Kernel ==="
uname -a

echo
echo "=== Uptime and Load ==="
uptime

We now have a baseline snapshot. The key output to keep in mind is the load average from uptime. Load is not “CPU usage”; it is runnable and uninterruptible tasks. High load with low CPU usage often points to I/O wait or blocked tasks.

Confirm CPU, memory, and swap posture

Next, we will check CPU topology, memory availability, and swap usage. This helps us quickly spot common failure modes: memory pressure, swap thrashing, or a host that is simply undersized for its workload.

echo "=== CPU Summary ==="
LC_ALL=C lscpu | sed -n '1,30p'

echo
echo "=== Memory Summary ==="
free -h

echo
echo "=== Swap Devices ==="
swapon --show || true

We now know how many CPUs we have, how memory is being used, and whether swap is active. If swap is heavily used and memory is tight, we should expect latency and unpredictable performance under load.

Step 2: Install lightweight, standard performance tools

Many servers already have the basics, but production environments often run minimal images. We will install a small, widely accepted set of packages that remain lightweight & free: procps (often already present), sysstat (for sar, iostat, pidstat), and iotop (optional but useful for I/O attribution). We will also include lsof for quick “what is holding this file/port” checks.

We are going to detect the package manager and install the tools safely. This does not open ports, does not add remote dependencies, and does not change firewall rules.

set -eu

if command -v apt-get >/dev/null 2>&1; then
  sudo apt-get update
  sudo apt-get install -y procps sysstat iotop lsof
elif command -v dnf >/dev/null 2>&1; then
  sudo dnf install -y procps-ng sysstat iotop lsof
elif command -v yum >/dev/null 2>&1; then
  sudo yum install -y procps-ng sysstat iotop lsof
elif command -v zypper >/dev/null 2>&1; then
  sudo zypper install -y procps sysstat iotop lsof
else
  echo "No supported package manager found (apt/dnf/yum/zypper)." 1>&2
  exit 1
fi

At this point, the host has the standard CLI tooling we will use for CPU, memory, disk, and process-level attribution. The most important addition is sysstat, because it enables historical performance snapshots via sar.

Verify the tools are available

We are going to confirm that the expected binaries exist. This avoids confusion later when a command is missing mid-incident.

command -v top
command -v vmstat
command -v iostat
command -v pidstat
command -v sar
command -v iotop
command -v lsof

If any command is missing, we should re-check the package installation step and confirm the distribution’s package names.

Step 3: Turn on persistent performance history with sysstat

Interactive tools are great, but they only show “now”. In production, the real question is often “what happened at 02:13?” or “did this start after the deploy?” For that, we want lightweight history that survives reboots. sysstat provides this through scheduled data collection and the sar command.

Enable and start sysstat collection

We are going to enable the sysstat service so it starts on boot, and start it immediately. This is a low-risk change: it collects metrics locally and writes to local files. No inbound network exposure is introduced.

set -eu

if systemctl list-unit-files | grep -q '^sysstat.service'; then
  sudo systemctl enable --now sysstat
  sudo systemctl status sysstat --no-pager
else
  echo "sysstat.service not found. On some distros, sysstat is driven by timers/cron. Continuing." 1>&2
fi

If the service exists, it is now enabled across reboots and running. If it does not exist, the distribution may use a timer unit or cron jobs for sysstat; we will verify data collection next.

Verify sysstat is collecting data

We are going to check whether sar can read today’s data. This is the simplest proof that collection is working.

set -eu

# Show CPU usage summary for today (may be sparse right after enabling)
sar -u | tail -n 20 || true

# Show where sysstat stores its daily files
if [ -d /var/log/sysstat ]; then
  ls -lah /var/log/sysstat | tail -n 20
elif [ -d /var/log/sa ]; then
  ls -lah /var/log/sa | tail -n 20
else
  echo "No sysstat log directory found yet. Wait a few minutes and check again." 1>&2
fi

If we see sar output and daily files under /var/log/sysstat or /var/log/sa, we have persistent history. If not, we should wait a few minutes and re-check; collection is periodic.

Step 4: Real-time triage flow that works under pressure

When a server feels slow, we need a calm sequence that narrows the problem without guesswork. We will follow a consistent order: CPU and run queue, memory pressure, disk latency, then process attribution. This order prevents us from chasing symptoms.

Check run queue, CPU saturation, and context switching

We are going to use vmstat because it is fast, always available, and tells a story: run queue (r), blocked tasks (b), swap activity, I/O, and CPU breakdown including I/O wait (wa).

vmstat 1 10

We now have a 10-second sample. What matters:

  • r consistently higher than CPU cores suggests CPU contention.
  • b above zero suggests tasks blocked on I/O.
  • si/so non-zero suggests swapping, often a performance killer.
  • wa high suggests I/O wait, not “CPU is busy doing work”.

Identify the top CPU and memory consumers

Next, we will use top in batch mode so the output is copy/paste friendly for incident notes. This avoids interactive key presses and gives us a stable snapshot.

top -b -n 1 | sed -n '1,80p'

We now see which processes are consuming CPU and memory at this moment. If CPU is high but the process list is not obvious, we will move to per-process statistics with pidstat.

Attribute CPU usage with pidstat

We are going to sample per-process CPU usage over a short interval. This is especially useful when top is noisy or when short-lived processes spike and disappear.

pidstat -u 1 10

This shows which PIDs are consuming CPU over the sample window. If we see high system CPU (%system) rather than user CPU, we may be dealing with kernel overhead, networking, or storage drivers.

Check memory pressure and OOM risk

We are going to inspect memory and the kernel’s view of pressure. free gives a human-friendly view, while /proc/meminfo and dmesg help confirm whether the kernel has been under stress or killing processes.

free -h
echo
grep -E 'MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Dirty|Writeback' /proc/meminfo
echo
dmesg -T | egrep -i 'oom|out of memory|killed process' | tail -n 30 || true

We now know whether memory is genuinely scarce (MemAvailable low), whether swap is being consumed, and whether the kernel has invoked the OOM killer. If OOM events exist, performance issues are often secondary to memory sizing or runaway processes.

Measure disk latency and saturation

Disk problems often masquerade as “CPU is high” because tasks pile up waiting for I/O. We are going to use iostat to measure per-device utilization and latency indicators.

iostat -xz 1 10

We now have extended disk stats. Key fields:

  • %util near 100% suggests the device is saturated.
  • await and svctm (where present) indicate latency; high values correlate with slow applications.
  • r/s, w/s show IOPS pressure.

Find which processes are driving I/O

When disks are busy, we need attribution. We are going to use iotop to identify which processes are reading/writing the most. This requires elevated privileges to see all processes.

sudo iotop -o -b -n 5

This prints the top I/O consumers over a few iterations. If a backup job, log rotation, or database compaction is dominating I/O, we will see it here and can decide whether to reschedule, throttle, or tune.

Confirm filesystem capacity and inode health

Performance incidents sometimes start as “disk full” or “no inodes left”, which then cascades into application failures and high load. We are going to check both capacity and inode usage.

df -hT
echo
df -ihT

We now know whether any filesystem is near capacity or inode exhaustion. If either is close to 100%, we should treat it as an urgent reliability issue, not just performance.

Check network sockets and retransmission hints

Even without deep packet analysis, we can quickly validate whether the server is overwhelmed by connections or if key services are listening as expected. We are going to list listening sockets and summarize TCP stats.

ss -lntup
echo
ss -s

We now see what is listening, which processes own the ports, and a high-level socket summary. If established connections are unexpectedly high, we may be dealing with traffic spikes, connection leaks, or upstream retries.

Step 5: Build a repeatable “health snapshot” script for consistent evidence

In real operations, the hardest part is not running commands—it is running the same commands every time and keeping the output somewhere safe. We will create a small local script that captures a timestamped snapshot. This stays lightweight, requires no inbound access, and is easy to audit.

Create a root-owned snapshot script with safe permissions

We are going to write a script to /usr/local/sbin, owned by root, and readable/executable by root. This reduces tampering risk and keeps operational tooling in a standard location.

set -eu

sudo install -d -m 0750 /usr/local/sbin
sudo install -d -m 0750 /var/log/healthsnap

sudo tee /usr/local/sbin/healthsnap.sh >/dev/null <<'EOF'
#!/bin/sh
set -eu

TS="$(date -u +%Y%m%dT%H%M%SZ)"
OUT="/var/log/healthsnap/health-${TS}.log"

{
  echo "=== Timestamp (UTC) ==="
  date -u
  echo

  echo "=== Host ==="
  hostnamectl 2>/dev/null || hostname
  echo

  echo "=== Uptime / Load ==="
  uptime
  echo

  echo "=== CPU / Memory ==="
  free -h
  echo
  vmstat 1 5
  echo

  echo "=== Top (snapshot) ==="
  top -b -n 1 | sed -n '1,80p'
  echo

  echo "=== Disk (df) ==="
  df -hT
  echo
  df -ihT
  echo

  echo "=== Disk (iostat sample) ==="
  if command -v iostat >/dev/null 2>&1; then
    iostat -xz 1 3
  else
    echo "iostat not installed"
  fi
  echo

  echo "=== Network (ss) ==="
  ss -s
  echo
  ss -lntup
  echo

  echo "=== Recent kernel warnings (dmesg tail) ==="
  dmesg -T | tail -n 80 || true
  echo

} > "${OUT}"

chmod 0640 "${OUT}"
echo "Wrote ${OUT}"
EOF

sudo chmod 0750 /usr/local/sbin/healthsnap.sh
sudo chown root:root /usr/local/sbin/healthsnap.sh
sudo chown root:root /var/log/healthsnap

We now have a consistent snapshot tool that writes logs to /var/log/healthsnap. The script is root-owned, and logs are not world-readable, which is important because process lists and socket output can contain sensitive operational details.

Verify the snapshot script works

We are going to run the script once and confirm that a log file is created and readable by privileged operators.

sudo /usr/local/sbin/healthsnap.sh
sudo ls -lah /var/log/healthsnap | tail -n 5
sudo tail -n 40 /var/log/healthsnap/health-*.log | tail -n 40

We now have a timestamped health log. This becomes our “evidence artifact” during incidents and a baseline reference during calm periods.

Step 6: Schedule snapshots with systemd for persistence across reboots

Manual snapshots are helpful, but production reliability improves when evidence is collected automatically. We will schedule the snapshot script using systemd timers. This is lightweight, local-only, and survives reboots cleanly.

Create a systemd service unit

We are going to define a oneshot service that runs the snapshot script. We will also apply basic hardening so the service has fewer privileges than a general-purpose shell.

set -eu

sudo tee /etc/systemd/system/healthsnap.service >/dev/null <<'EOF'
[Unit]
Description=NIILAA Linux health snapshot (local)

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/healthsnap.sh
User=root
Group=root

# Basic hardening (keep compatible with common distros)
NoNewPrivileges=true
PrivateTmp=true
ProtectHome=true
ProtectControlGroups=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectSystem=full
ReadWritePaths=/var/log/healthsnap
EOF

We now have a service definition that can run on demand or via a timer. The hardening settings reduce the blast radius if the script is ever modified incorrectly.

Create a systemd timer unit

We are going to schedule the service to run every 5 minutes. This cadence is frequent enough for incident reconstruction without generating excessive logs on most systems.

set -eu

sudo tee /etc/systemd/system/healthsnap.timer >/dev/null <<'EOF'
[Unit]
Description=Run NIILAA Linux health snapshot every 5 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
AccuracySec=30s
Persistent=true

[Install]
WantedBy=timers.target
EOF

We now have a timer that will trigger snapshots regularly and catch up after reboots (Persistent=true), which is exactly what we want for production continuity.

Enable and verify the timer

We are going to reload systemd, enable the timer, and verify it is scheduled and running.

set -eu

sudo systemctl daemon-reload
sudo systemctl enable --now healthsnap.timer

sudo systemctl status healthsnap.timer --no-pager
sudo systemctl list-timers --all | grep -E 'healthsnap.timer|NEXT|LEFT' || true

The timer is now enabled across reboots and actively scheduling runs. If we wait a few minutes, we should see new logs appear in /var/log/healthsnap.

Verify logs are being generated over time

We are going to confirm that multiple snapshot files exist and that timestamps are increasing.

sudo ls -lah /var/log/healthsnap | tail -n 10

If new files appear every few minutes, the system is collecting evidence continuously without any external dependencies.

Step 7: Use sar for “what changed” questions

Now that sysstat is enabled, we can answer the questions that usually matter most: “When did CPU spike?”, “Was disk wait high overnight?”, “Did network traffic jump after a change?” This is where built-in history becomes operationally powerful.

Review CPU history

We are going to query CPU usage history for today. This helps us see trends rather than a single moment.

sar -u | tail -n 30

We now have a time series of CPU usage. If %iowait is elevated during the slow period, we should focus on storage and blocked tasks rather than CPU tuning.

Review memory and swap history

We are going to check memory and swap activity over time. This helps confirm whether the host is under sustained memory pressure or experiencing periodic spikes.

sar -r | tail -n 30
echo
sar -S | tail -n 30

We now see memory utilization and swap usage trends. Sustained swap usage is a strong indicator that we should address memory sizing, workload behavior, or caching strategy.

Review disk and I/O wait history

We are going to check block device activity and I/O wait trends. This is often the missing link in “load is high but CPU is not” incidents.

sar -b | tail -n 30
echo
sar -d | tail -n 30 || true

We now have historical I/O activity. If sar -d is not available on a given distro configuration, we can rely on iostat and the recurring health snapshots for disk visibility.

Security and operational considerations

  • No inbound exposure: This approach does not require opening ports. Firewall changes are not applicable, and that is a feature.
  • Least privilege: Interactive triage can be done as a sudo-capable operator. The scheduled snapshot runs as root because it needs full visibility into processes and kernel logs; we mitigate risk with root-owned scripts, restrictive permissions, and systemd hardening.
  • Data sensitivity: Process lists, sockets, and kernel logs can reveal service names, ports, and operational details. We store logs under /var/log/healthsnap with 0640 permissions and root ownership.
  • Disk usage management: Frequent snapshots create files. In environments with strict disk budgets, we should add log rotation. If the host already uses logrotate, we can integrate cleanly.

Optional: Add logrotate for snapshot logs

We are going to configure log rotation so snapshot logs do not grow indefinitely. This is a production hygiene step that prevents “monitoring filled the disk” incidents.

set -eu

if command -v logrotate >/dev/null 2>&1; then
  sudo tee /etc/logrotate.d/healthsnap >/dev/null <<'EOF'
/var/log/healthsnap/health-*.log {
  daily
  rotate 14
  missingok
  notifempty
  compress
  delaycompress
  copytruncate
  su root root
}
EOF

  sudo logrotate -d /etc/logrotate.d/healthsnap | sed -n '1,120p'
else
  echo "logrotate not installed; skipping log rotation configuration." 1>&2
fi

If logrotate is present, we now have a 14-day retention policy with compression. The debug run shows what logrotate would do without making changes.

Troubleshooting

sysstat shows no data in sar

  • Symptom: sar -u returns “Cannot open /var/log/sa/saXX” or shows no entries.
  • Likely causes: sysstat collection not enabled, timer/cron not running, or not enough time has passed to collect the first sample.
  • Fix: Enable and start sysstat (where supported), then wait a few minutes and re-check.
sudo systemctl enable --now sysstat 2>/dev/null || true
sudo systemctl status sysstat --no-pager 2>/dev/null || true
sar -u | tail -n 20 || true
ls -lah /var/log/sysstat 2>/dev/null || true
ls -lah /var/log/sa 2>/dev/null || true

We have now re-validated service state and log directories. If the distro uses cron instead of a systemd service, we should confirm sysstat’s scheduled jobs exist in /etc/cron.d or equivalent.

healthsnap timer is enabled but no logs appear

  • Symptom: systemctl status healthsnap.timer is active, but /var/log/healthsnap stays empty.
  • Likely causes: service failing, script not executable, or systemd hardening preventing writes.
  • Fix: Inspect the last run status and journal logs, then run the service manually.
sudo systemctl status healthsnap.service --no-pager || true
sudo systemctl start healthsnap.service
sudo systemctl status healthsnap.service --no-pager
sudo journalctl -u healthsnap.service --no-pager -n 80
sudo ls -lah /var/log/healthsnap | tail -n 10

We now see whether the service ran successfully and whether logs were written. If the journal shows permission errors, we should confirm /var/log/healthsnap exists and is writable by root, and that ReadWritePaths includes it (it does in our unit).

iotop shows nothing useful

  • Symptom: iotop runs but shows zero I/O or only partial processes.
  • Likely causes: not running with sudo, kernel accounting limitations, or the workload is not I/O-bound at that moment.
  • Fix: Run with sudo and correlate with iostat to confirm whether the system is actually doing disk I/O.
sudo iotop -o -b -n 5
iostat -xz 1 5

We have now confirmed whether disk activity exists and whether iotop can attribute it. If iostat shows low activity, the bottleneck is likely elsewhere.

High load average but CPU is low

  • Symptom: uptime shows high load, but top shows low CPU usage.
  • Likely causes: tasks blocked on disk or network I/O, or filesystem issues.
  • Fix: Check blocked tasks and I/O wait with vmstat and iostat, then attribute with iotop.
vmstat 1 10
iostat -xz 1 10
sudo iotop -o -b -n 5

We have now validated whether the load is driven by blocked I/O. If b and wa are elevated and disk latency is high, storage is the likely bottleneck.

Common mistakes

Running only interactive tools and losing evidence

  • Symptom: We “saw it in top” but cannot prove what happened later.
  • Fix: Use batch outputs (top -b, pidstat) and enable persistent collection (sysstat + systemd timer snapshots).

Confusing load average with CPU usage

  • Symptom: Load is high, so we assume CPU is the problem, but scaling CPU does not help.
  • Fix: Use vmstat and iostat to confirm whether tasks are blocked and whether I/O wait is driving the load.

Ignoring inode exhaustion

  • Symptom: Applications fail to create files, logs stop, services behave unpredictably, but disk “space” looks fine.
  • Fix: Always check df -ihT alongside df -hT, then clean up small-file directories or redesign log retention.

Collecting logs with world-readable permissions

  • Symptom: Sensitive process names, ports, or kernel messages are readable by non-privileged users.
  • Fix: Keep snapshot logs under a restricted directory and enforce 0640 permissions with root ownership, as we did.

How do we at NIILAA look at this

This setup is not impressive because it is complex. It is impressive because it is controlled. Every component is intentional. Every configuration has a reason. This is how infrastructure should scale — quietly, predictably, and without drama.

At NIILAA, we help organizations design, deploy, secure, and maintain production-grade Linux operational baselines like this—so performance monitoring is not a scramble during incidents, but a calm, repeatable practice backed by evidence. We standardize host telemetry, harden operational tooling, align retention with compliance needs, and build runbooks that teams can actually execute under pressure.

Website: https://www.niilaa.com
Email: [email protected]
LinkedIn: https://www.linkedin.com/company/niilaa
Facebook: https://www.facebook.com/niilaa.llc

Leave A Comment

All fields marked with an asterisk (*) are required