Hide Forgot
Description of problem: * vmstat 'b' field, as per its manpage: b: The number of processes in uninterruptible sleep. * Generating a synthetic amount of D-state processes will not cause any change in vmstat 'b' field, nor in dstat -ap. Version-Release number of selected component (if applicable): * procps-3.2.8-36.el6 How reproducible: * 100% / Always Steps to Reproduce: 1. Freeze a filesystem (not you root!) using fsfreeze -f /test for example 2. Monitor the system running vmstat 1 and/or dstat -ap 3. Run the following script: $ for i in {1..30} ; do touch /test/$i.txt & done Actual results: * vmstat or dstat will not show any change in 'b' / 'blk' fields Expected results: * vmstat or dstat would have to point the D-state hung processes. Additional info: ---
vmstat takes data from /proc/stat, where the value really is 0 for some reason (however in /proc/PID/stat the information about D-status is correct). Seems like a matter of kernel.
https://www.kernel.org/doc/Documentation/filesystems/proc.txt states: > 'The "procs_blocked" line gives the number of processes currently blocked, > waiting for I/O to complete. However, the described reproducer in Comment #0 does not change procs_blocked.
According to the kernel implementation of fs/proc/stat.c :: show_stat(), what the kernel is reporting as "procs_blocked" is the sum of all per-cpu "nr_iowait" variables. nr_iowait is *only* updated when a thread calls io_schedule_timeout() -- incremented before scheduling and decremented after waking up. Note: it is very possible for a kernel thread to be TASK_UNINTERRUPTIBLE (ie, D-state) *without* waiting on I/O. For example, any call to msleep will set the current task to TASK_UNINTERRUPTIBLE. In this case, instead of waiting until a disk I/O completes, the task waits for a timer to expire. What about the example in Comment #0? % dd if=/dev/zero of=/tmp/temp bs=1M count=500 % losetup /dev/loop1 /tmp/temp % mkfs.ext4 /dev/loop1 % mkdir /mnt/temp % mount /dev/loop1 /mnt/temp % fsfreeze -f /mnt/temp/ % touch /mnt/temp/foo & % grep procs_blocked /proc/stat procs_blocked 0 % cat /proc/$(pgrep touch)/stat 3014 (touch) D 2963 3014 2963 34816 3059 4202496 272 0 0 0 0 0 0 0 20 0 1 0 1088036 110514176 87 18446744073709551615 4194304 4244516 140734986742128 140734986741528 139644875737616 0 0 0 0 18446744071580943694 0 0 17 12 0 0 0 0 0 6345544 6349600 34729984 140734986748923 140734986748943 140734986748943 140734986751977 0 % cat /proc/$(pgrep touch)/stack [<ffffffff8120054e>] __sb_start_write+0xde/0x110 [<ffffffff8121e5a4>] mnt_want_write+0x24/0x50 [<ffffffff8120cdff>] do_last+0xc1f/0x12a0 [<ffffffff8120d542>] path_openat+0xc2/0x490 [<ffffffff8120f6bb>] do_filp_open+0x4b/0xb0 [<ffffffff811fcbd3>] do_sys_open+0xf3/0x1f0 [<ffffffff811fccee>] SyS_open+0x1e/0x20 [<ffffffff81693a09>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff crash> dis -l __sb_start_write+0xde /usr/src/debug/kernel-3.10.0-501.el7/linux-3.10.0-501.el7.x86_64/fs/super.c: 1147 0xffffffff8120054e <__sb_start_write+222>: lea -0x58(%rbp),%rsi 1141 int __sb_start_write(struct super_block *sb, int level, bool wait) 1142 { 1143 retry: 1144 if (unlikely(sb->s_writers.frozen >= level)) { 1145 if (!wait) 1146 return 0; 1147 wait_event(sb->s_writers.wait_unfrozen, 1148 sb->s_writers.frozen < level); 1149 } The this case, the filesystem was frozen, so we haven't even gotten as far as pushing any I/O out to the device. The implementation of include/linux/wait.h :: wait_event() sets TASK_UNINTERRUPTIBLE, checks on a condition and schedules continuously until the condition is met. The kernel documentation would probably be a little clearer if it read, "The "procs_blocked" line gives the number of processes currently blocked *ON* waiting for I/O to complete." ^^^^ It would also be clearer if the kernel /proc/stat interface had printed "nr_iowait" (as it's referred to in the source code) rather than "procs_blocked". However, that ship has sailed and renaming this field will break all manner of "awk '/procs_blocked/{print $NF}' /proc/stat" type scripts. I took a peek at the source for vmstat and it's interesting that in the absence of a /proc/stat "procs_blocked" field (Linux 2.5.46 (approximately) and below), it will iterate through all /proc/<PID>/stat files looking for 'R' or 'D' fields. In 2002 code was added to use /proc/stat "procs_blocked" if it was available. Nothing in the commit message for this change makes reference to this field or any change in reporting semantics. IMHO, this was a bug introduced into vmstat long ago... since nobody has complained in the interim, I would suggest changing the vmstat documentation to match its current implementation: FIELD DESCRIPTION FOR VM MODE Procs r: The number of runnable processes (running or waiting for run time). b: The number of processes blocked on IO (in uninterruptible sleep.) ^^^^^^^^^^^^^ or something to the effect of explaining that the value is only the subset of TASK_UNINTERRUPTIBLE waiting on I/O completion.
(In reply to Joe Lawrence from comment #4) > > nr_iowait is *only* updated when a thread calls io_schedule_timeout() -- > incremented before scheduling and decremented after waking up. exactly! and in this case /usr/bin/touch waits for the semaphore. I was very sure I have already close this bug as NOTABUG... probably it was another one with the same description. I think this one should be closed too.
(In reply to Oleg Nesterov from comment #6) > I was very sure I have already close this bug as NOTABUG... probably it > was another one with the same description. > > I think this one should be closed too. https://www.youtube.com/watch?v=L0MK7qz13bU&feature=youtu.be&t=65