High Load - Low IO - Low CPU
ps axu 中获取状态(STAT)为 D(uninterruptible sleep) 的进程

Why?

一次遇到高 load, 低 cpu, 低 io 的服务器情况. 后来发现是有进程处于 D(uninterruptible sleep) 状态导致的. 在 4C8G 的服务器上, load average 可高达 100+, 但其他程序的运行并没有被影响.

1
2
3
4
5
6
7
8
9
10
11
12
man ps
...
PROCESS STATE CODES
Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process:
D uninterruptible sleep (usually IO)
R running or runnable (on run queue)
S interruptible sleep (waiting for an event to complete)
T stopped by job control signal
t stopped by debugger during the tracing
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z defunct ("zombie") process, terminated but not reaped by its parent

command

下面是获取 STAT/S/stat 中带有 D 的进程.

1
ps aux | awk 'NR==1{for (i=1;i<=NF;i++){ if ($i=="STAT" || $i=="S" || $i=="stat") { k=i;} }} $k ~ /D/ {print $0}'

And why?

简单来说是进程处于 syscall, 是一种无法终端的操作, 如 mkdir 操作. 在 NFS 文件系统上操作可能会出现的情况. 另外该状态的进程无法被 strace…至于我遇到的进程, 是 node-exporter/ls/bash 处于了 D.

Reference

High Load - Low IO - Low CPU usage
man ps
uninterruptible-sleep