Bug 1374332 - /proc/stat reports zero procs_blocked even with D-state processes
Summary: /proc/stat reports zero procs_blocked even with D-state processes
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.8
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Oleg Nesterov
QA Contact: Chunyu Hu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-08 13:05 UTC by Rodrigo A B Freire
Modified: 2017-04-28 16:56 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1374397 (view as bug list)
Environment:
Last Closed: 2017-04-03 15:14:58 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2608861 0 None None None 2016-09-08 14:58:28 UTC

Internal Links: 1796043

Description Rodrigo A B Freire 2016-09-08 13:05:50 UTC
Description of problem:
 * vmstat 'b' field, as per its manpage:
       b: The number of processes in uninterruptible sleep.
 * Generating a synthetic amount of D-state processes will not cause any change in vmstat 'b' field, nor in dstat -ap.

Version-Release number of selected component (if applicable):
* procps-3.2.8-36.el6

How reproducible:
* 100% / Always

Steps to Reproduce:
1. Freeze a filesystem (not you root!) using fsfreeze -f /test for example
2. Monitor the system running vmstat 1 and/or dstat -ap
3. Run the following script:
   $ for i in {1..30} ; do touch /test/$i.txt & done

Actual results:
* vmstat or dstat will not show any change in 'b' / 'blk' fields

Expected results:
* vmstat or dstat would have to point the D-state hung processes.

Additional info:
---

Comment 1 Jan Rybar 2016-09-08 13:19:23 UTC
vmstat takes data from /proc/stat, where the value really is 0 for some reason (however in /proc/PID/stat the information about D-status is correct). Seems like a matter of kernel.

Comment 2 Rodrigo A B Freire 2016-09-08 15:01:13 UTC
https://www.kernel.org/doc/Documentation/filesystems/proc.txt states:

> 'The   "procs_blocked" line gives  the  number of  processes currently blocked,
> waiting for I/O to complete.

However, the described reproducer in Comment #0 does not change procs_blocked.

Comment 4 Joe Lawrence 2016-09-12 20:00:21 UTC
According to the kernel implementation of fs/proc/stat.c :: show_stat(),
what the kernel is reporting as "procs_blocked" is the sum of all
per-cpu "nr_iowait" variables.

nr_iowait is *only* updated when a thread calls io_schedule_timeout() --
incremented before scheduling and decremented after waking up.

Note: it is very possible for a kernel thread to be TASK_UNINTERRUPTIBLE
(ie, D-state) *without* waiting on I/O.  For example, any call to msleep
will set the current task to TASK_UNINTERRUPTIBLE.  In this case,
instead of waiting until a disk I/O completes, the task waits for a
timer to expire.

What about the example in Comment #0?

  % dd if=/dev/zero of=/tmp/temp bs=1M count=500
  % losetup /dev/loop1 /tmp/temp
  % mkfs.ext4 /dev/loop1
  % mkdir /mnt/temp
  % mount /dev/loop1 /mnt/temp
  % fsfreeze -f /mnt/temp/
  % touch /mnt/temp/foo &

  % grep procs_blocked /proc/stat
  procs_blocked 0

  % cat /proc/$(pgrep touch)/stat
  3014 (touch) D 2963 3014 2963 34816 3059 4202496 272 0 0 0 0 0 0 0 20 0 1 0 1088036 110514176 87 18446744073709551615 4194304 4244516 140734986742128 140734986741528 139644875737616 0 0 0 0 18446744071580943694 0 0 17 12 0 0 0 0 0 6345544 6349600 34729984 140734986748923 140734986748943 140734986748943 140734986751977 0

  % cat /proc/$(pgrep touch)/stack
  [<ffffffff8120054e>] __sb_start_write+0xde/0x110
  [<ffffffff8121e5a4>] mnt_want_write+0x24/0x50
  [<ffffffff8120cdff>] do_last+0xc1f/0x12a0
  [<ffffffff8120d542>] path_openat+0xc2/0x490
  [<ffffffff8120f6bb>] do_filp_open+0x4b/0xb0
  [<ffffffff811fcbd3>] do_sys_open+0xf3/0x1f0
  [<ffffffff811fccee>] SyS_open+0x1e/0x20
  [<ffffffff81693a09>] system_call_fastpath+0x16/0x1b
  [<ffffffffffffffff>] 0xffffffffffffffff

  crash> dis -l __sb_start_write+0xde
  /usr/src/debug/kernel-3.10.0-501.el7/linux-3.10.0-501.el7.x86_64/fs/super.c: 1147
  0xffffffff8120054e <__sb_start_write+222>:      lea    -0x58(%rbp),%rsi

  1141 int __sb_start_write(struct super_block *sb, int level, bool wait)
  1142 {
  1143 retry:
  1144         if (unlikely(sb->s_writers.frozen >= level)) {
  1145                 if (!wait)
  1146                         return 0;
  1147                 wait_event(sb->s_writers.wait_unfrozen,
  1148                            sb->s_writers.frozen < level);
  1149         }

The this case, the filesystem was frozen, so we haven't even gotten as
far as pushing any I/O out to the device.  The implementation of
include/linux/wait.h :: wait_event() sets TASK_UNINTERRUPTIBLE, checks
on a condition and schedules continuously until the condition is met.

The kernel documentation would probably be a little clearer if it read,
"The "procs_blocked" line gives the number of processes currently
blocked *ON* waiting for I/O to complete."
        ^^^^

It would also be clearer if the kernel /proc/stat interface had printed
"nr_iowait" (as it's referred to in the source code) rather than
"procs_blocked".  However, that ship has sailed and renaming this field
will break all manner of "awk '/procs_blocked/{print $NF}' /proc/stat"
type scripts.

I took a peek at the source for vmstat and it's interesting that in the
absence of a /proc/stat "procs_blocked" field (Linux 2.5.46
(approximately) and below), it will iterate through all /proc/<PID>/stat
files looking for 'R' or 'D' fields.  In 2002 code was added to use
/proc/stat "procs_blocked" if it was available.  Nothing in the commit
message for this change makes reference to this field or any change in
reporting semantics.  IMHO, this was a bug introduced into vmstat long
ago... since nobody has complained in the interim, I would suggest
changing the vmstat documentation to match its current implementation:

FIELD DESCRIPTION FOR VM MODE
   Procs
       r: The number of runnable processes (running or waiting for run time).
       b: The number of processes blocked on IO (in uninterruptible sleep.)
                                  ^^^^^^^^^^^^^
or something to the effect of explaining that the value is only the
subset of TASK_UNINTERRUPTIBLE waiting on I/O completion.

Comment 6 Oleg Nesterov 2017-03-23 22:50:25 UTC
(In reply to Joe Lawrence from comment #4)
>
> nr_iowait is *only* updated when a thread calls io_schedule_timeout() --
> incremented before scheduling and decremented after waking up.

exactly!

and in this case /usr/bin/touch waits for the semaphore.

I was very sure I have already close this bug as NOTABUG... probably it
was another one with the same description.

I think this one should be closed too.

Comment 7 Rodrigo A B Freire 2017-03-24 00:43:57 UTC
(In reply to Oleg Nesterov from comment #6)

> I was very sure I have already close this bug as NOTABUG... probably it
> was another one with the same description.
> 
> I think this one should be closed too.

https://www.youtube.com/watch?v=L0MK7qz13bU&feature=youtu.be&t=65


Note You need to log in before you can comment on or make changes to this bug.