Bug 1017349 - race condition in mm_for_maps
race condition in mm_for_maps
Status: CLOSED DUPLICATE of bug 844450
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Seth Jennings
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-09 13:11 EDT by Andy Grimm
Modified: 2016-11-07 22:47 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-30 12:44:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sysrq -t data (472.48 KB, text/plain)
2013-10-15 15:48 EDT, Andy Grimm
no flags Details

  None (edit)
Description Andy Grimm 2013-10-09 13:11:33 EDT
Description of problem:

There's a condition frequently occurring in OpenShift Online where lsof is hanging in mm_for_maps. PIDs are recycled frequently (usually within a 5- to 10-minute interval), and I believe that the process whose maps entry lsof is trying to read is actually gone. It appears that the _next_ process to get the same PID hangs and must be sent a SIGKILL before any future lsof call will work properly.

Example:
# ps -eo user,pid,ppid,stat,pcpu,wchan:32,etime,cmd | /bin/awk 'NR==1; $4 ~ /D/'
USER       PID  PPID STAT %CPU WCHAN                                ELAPSED CMD
root      7542 31943 DN    0.0 mm_for_maps                         02:30:53 /usr/sbin/lsof -wnlP +c 0

# ls -l /proc/7542/fd
total 0
lr-x------. 1 root root 64 Oct  9 12:06 0 -> pipe:[29986258]
l-wx------. 1 root root 64 Oct  9 12:06 1 -> pipe:[30459868]
l-wx------. 1 root root 64 Oct  9 12:06 2 -> pipe:[30459868]
lr-x------. 1 root root 64 Oct  9 12:06 3 -> /proc
lr-x------. 1 root root 64 Oct  9 12:06 4 -> /proc/21085/maps
l-wx------. 1 root root 64 Oct  9 12:06 5 -> pipe:[30459893]
lr-x------. 1 root root 64 Oct  9 12:06 6 -> pipe:[30459894]

# ls -l /proc/21085
total 0
dr-xr-xr-x. 2 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:45 attr
-rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 autogroup
-r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 auxv
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 cgroup
--w-------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 clear_refs
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:39 cmdline
-rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 coredump_filter
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 cpuset
lrwxrwxrwx. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:34 cwd -> /var/lib/openshift/51e6a8b24382ec97f200009a/app-root/runtime/repo
-r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 environ
lrwxrwxrwx. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:39 exe -> /usr/bin/node
dr-x------. 2 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 fd
dr-x------. 2 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 fdinfo
-r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 io
-rw-------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 limits
-rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 loginuid
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:34 maps
-rw-------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 mem
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 mountinfo
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 mounts
-r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 mountstats
dr-xr-xr-x. 6 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 net
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 numa_maps
-rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 oom_adj
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 oom_score
-rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 oom_score_adj
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 pagemap
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 personality
lrwxrwxrwx. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:34 root -> /
-rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 sched
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 schedstat
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 sessionid
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 smaps
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 stack
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:39 stat
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 statm
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 09:39 status
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 syscall
dr-xr-xr-x. 3 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 task
-r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct  9 12:06 wchan

(Note the unusual timestamp on /proc/21085/maps -- typically it's the same as other files in the directory)

# /usr/sbin/lsof -wnlP +c 0
(hangs. hit ctrl-c.  may need to kill -9)

# ps -p 21085 -o user,pid,ppid,stat,pcpu,wchan:32,etime,comm 
USER       PID  PPID STAT %CPU WCHAN                                ELAPSED COMMAND
2383     21085  8328 R     0.0 -                                 1-02:56:44 node

Next I confirm that the app served by this process in not actually working properly. The steps for that aren't relevant here.  Getting rid of the problematic process sometimes just requires a "gentle" kill, and sometimes only works with something like "killall -9 -u <uid>".  After process 21085 is killed by whatever means was necessary, the hung lsof command completes.

Version-Release number of selected component (if applicable):
kernel-2.6.32-358.18.1.el6.x86_64
lsof-4.82-4.el6.x86_64

How reproducible:

I haven't been able to reproduce this on-demand, but it happens several times per day on OpenShift Online nodes.

Additional info:

In addition to PIDs being recycled, another characteristic of these systems is that most tasks are run in cgroups with memory limits, and it is common for tasks to be OOM killed.

We are also using cgred, and while we see the following message frequently, it always occurs for the problematic PID:

CGRE[1271]: Warning: cannot write tid 21085 to /cgroup/all/openshift/51e6a8b24382ec97f200009a//tasks:No such process

I'm not sure whether that's related.
Comment 2 Larry Woodman 2013-10-10 13:43:28 EDT
Andy, can you get the kernel stack trace of the process thats hanging in mm_for_maps()?  

Thanks!  Larry
Comment 3 Andy Grimm 2013-10-13 10:40:32 EDT
I finally got two of these today.  They are identical:

[<ffffffff811eae1b>] mm_for_maps+0x2b/0xa0
[<ffffffff811e7dbd>] m_start+0x6d/0x160
[<ffffffff811a52c6>] seq_read+0x96/0x400
[<ffffffff81181a95>] vfs_read+0xb5/0x1a0
[<ffffffff81181bd1>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

I have now found, though, that this is the same underlying cause as the call_rwsem_down_read_failed issue (rhbz844450) .  It is the result of a process with a lock in a gear/cgroup which is receiving no CPU cycles.  I can unstick things by running:

echo -1 > /cgroup/all/openshift/<UUID>/cpu.cfs_quota_us

where <UUID> is the "stuck" gear which is holding the lock.
Comment 4 Larry Woodman 2013-10-14 10:24:47 EDT
Andy, sorry for my confusion here but is this all thats in the messages file???  It doesnt tell me what mm_for_maps() is doing, it can block and/or spin on a lock, since you say its receiving no CPU cycles I assume its blocked down in the bowls of __mutex_lock_common() but cant prove it.  Can I either get access to a system in this state or can you get me an AltSysrq-T output so I can see exactly what its doing.  Also, you said this is the same problem as BZ844450 where a process hangs on the mmap _sem semaphore.  Where does mm_for_maps() get called with the mmap_sem held?

Larry
Comment 5 Andy Grimm 2013-10-14 16:15:04 EDT
Larry, the confusion is my fault.  I got a little excited about the workaround to BZ844450, which also works around this bug, but the bug itself here is different, and I've now conflated them.

So, what I provided above was simply the contents of /proc/<PID>/stack for the stuck lsof process.  There's nothing in /var/log/messages for these, as far as I could tell.  (I know sometimes there's a "task hung for 120 seconds" message or something for hung processes, but I did not notice that for these.)  I will grab AltSysrq-T output next time I see the issue, or I'll ping you to get you access if the system is in a state where we can spend time debugging.

I have to say that, while I find this issue intriguing, the priority is pretty low for us compared to BZ844450, which, while we've got a workaround, requires some pretty "smart" scripts to auto-recover from.
Comment 6 Andy Grimm 2013-10-15 15:47:23 EDT
Ok, I got another instance of this.  I will try again to walk through how I debugged and worked around it.  First, the symptom is this:

[root@ex-std-node260.prod ~]# ps -o user,pid,ppid,state,wchan:32,cmd -p 23679,23683
USER       PID  PPID S WCHAN                            CMD
root     23679  3716 D mm_for_maps                      /usr/sbin/lsof -wnlP +c 0
root     23683 23679 S pipe_wait                        /usr/sbin/lsof -wnlP +c 0
[root@ex-std-node260.prod ~]# ls -l /proc/23679/fd
total 0
lr-x------. 1 root root 64 Oct 15 14:23 0 -> pipe:[385978154]
l-wx------. 1 root root 64 Oct 15 14:23 1 -> pipe:[386422190]
l-wx------. 1 root root 64 Oct 15 14:23 2 -> pipe:[386422190]
lr-x------. 1 root root 64 Oct 15 14:23 3 -> /proc
lr-x------. 1 root root 64 Oct 15 14:23 4 -> /proc/22283/maps
l-wx------. 1 root root 64 Oct 15 14:23 5 -> pipe:[386422205]
lr-x------. 1 root root 64 Oct 15 14:23 6 -> pipe:[386422206]
[root@ex-std-node260.prod ~]# cat /proc/23679/stack 
[<ffffffff811eac2b>] mm_for_maps+0x2b/0xa0
[<ffffffff811e7bcd>] m_start+0x6d/0x160
[<ffffffff811a5096>] seq_read+0x96/0x400
[<ffffffff81181845>] vfs_read+0xb5/0x1a0
[<ffffffff81181981>] sys_read+0x51/0x90
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

I should note that any attempt to run another instance of lsof at this point will hang in at the same point.  Now, let's look at the PID that it's stuck on:

[root@ex-std-node260.prod ~]# ls -la /proc/22283/
total 0
dr-xr-xr-x.   7 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 07:40 .
dr-xr-xr-x. 714 root                     root                     0 Sep  9 12:47 ..
dr-xr-xr-x.   2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:45 attr
-rw-r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 autogroup
-r--------.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 auxv
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cgroup
--w-------.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 clear_refs
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:32 cmdline
-rw-r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 coredump_filter
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cpuset
lrwxrwxrwx.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cwd -> /var/lib/openshift/525bf22a500446fc4b0000fe/app-root/runtime/repo
-r--------.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 environ
lrwxrwxrwx.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 exe -> /opt/rh/nodejs010/root/usr/bin/node
dr-x------.   2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 fd
dr-x------.   2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 fdinfo
-r--------.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 io
-rw-------.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 limits
-rw-r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 loginuid
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 09:24 maps
-rw-------.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mem
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mountinfo
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mounts
-r--------.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mountstats
dr-xr-xr-x.   6 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 net
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 numa_maps
-rw-r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_adj
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_score
-rw-r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_score_adj
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 pagemap
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 personality
lrwxrwxrwx.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 root -> /
-rw-r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 sched
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 schedstat
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 sessionid
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 smaps
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 stack
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:35 stat
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 statm
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:32 status
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 syscall
dr-xr-xr-x.   3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 task
-r--r--r--.   1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 wchan
[root@ex-std-node260.prod ~]# ls -la /proc/22283/task
total 0
dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 .
dr-xr-xr-x. 7 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 07:40 ..
dr-xr-xr-x. 5 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 22283
[root@ex-std-node260.prod ~]# ls -la /proc/22283/task/22283
total 0
dr-xr-xr-x. 5 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 .
dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 ..
dr-xr-xr-x. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 attr
-r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 auxv
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cgroup
--w-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 clear_refs
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cmdline
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cpuset
lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cwd -> /var/lib/openshift/525bf22a500446fc4b0000fe/app-root/runtime/repo
-r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 environ
lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 exe -> /opt/rh/nodejs010/root/usr/bin/node
dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 fd
dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 fdinfo
-r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 io
-rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 limits
-rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 loginuid
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 maps
-rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mem
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mountinfo
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mounts
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 numa_maps
-rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_adj
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_score
-rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_score_adj
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 pagemap
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 personality
lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 root -> /
-rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 sched
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 schedstat
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 sessionid
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 smaps
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 stack
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 stat
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 statm
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 status
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 syscall
-r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 wchan

and then at the process list for the gear that wons this task:

[root@ex-std-node260.prod ~]# ps -u 525bf22a500446fc4b0000fe -o pid,ppid,state,wchan:32,pending,comm
  PID  PPID S WCHAN                                     PENDING COMMAND
21884     1 S pipe_wait                        0000000000000000 node
22283 21884 R -                                0000000000000000 node

So, two processes, and the problematic one is RUNNABLE.  That's the important part.  Now let's look at the cgroup's CPU usage:

[root@ex-std-node260.prod ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage
17111873633113
[root@ex-std-node260.prod ~]# sleep 10; cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage
17111873633113

Here we've gone ten seconds without a single nanosecond of CPU consumed.   But if we bump the scheduler, like so:

[root@ex-std-node260.prod ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpu.cfs_quota_us 
30000
[root@ex-std-node260.prod ~]# echo 35000 > /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpu.cfs_quota_us

suddenly we get to use the CPU:

[root@ex-std-node260.prod ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage
17113176341247

and the process exits cleanly:

[root@ex-std-node260.prod ~]# ps -u 525bf22a500446fc4b0000fe -o pid,ppid,state,wchan:32,pending,comm
  PID  PPID S WCHAN                                     PENDING COMMAND
21884     1 S ep_poll                          0000000000000000 node
32301 21884 R -                                0000000000000000 node

and our hung lsof processes are gone:

[root@ex-std-node260.prod ~]# ps -o user,pid,ppid,state,wchan:32,cmd -p 23679,23683
USER       PID  PPID S WCHAN                            CMD

So, that demonstrates that this is indeed related to BZ844450 .  It's actually not clear to me whether needs to be addressed separately.  I thought there were cases where we hit this without hitting BZ844450, but I'm no longer finding incidences of that.
Comment 7 Andy Grimm 2013-10-15 15:48:38 EDT
Created attachment 812661 [details]
sysrq -t data
Comment 8 Larry Woodman 2013-10-15 16:09:04 EDT
Andy, wait a minute.  Are you over-riding the default cpu cgroup values and if so what are their values now???

Larry

BTW, there is no mm_for_maps in that sysrq -t output at all.  Is the system burning a CPU on the process stuck there?  If so echo L > /proc/sysrq-trigger.
Comment 9 Andy Grimm 2013-10-15 16:31:43 EDT
Yes, our cgroup configuration for each small gear sets the following:

cpu.cfs_quota_us = 100000
cpu.shares = 128
memory.limit_in_bytes = 536870912
memory.move_charge_at_immigrate = 1
memory.memsw.limit_in_bytes = 641728512

Additionally, there is a "throttler" which changes cpu.cfs_quota_us to 30000 when a cgroup consumes too much CPU over a period of time.  I would speculate that this is the cause of the problem, except that we saw this issue well before the throttler was implemented.

Perhaps I'm confused now; I'm not sure why you expect to see mm_for_maps in the sysrq -t output.  The process stuck in mm_for_maps is not on the CPU; it was waiting for a lock on /proc/22283/maps , which is held by process 22283, which was runnable, but not on the CPU.  From the sysrq output:

 node 22283   7260750.330346         3   120   7260750.330346        16.559928         0.000000 /openshift/525bf22a500446fc4b0000fe

The problem is that the scheduler was never putting this task on the CPU until I wrote a new value to the cgroup's cpu.cfs_quota_us.  And it doesn't seem to matter what the value is, just that something is written to cause that cgroup's scheduler priority to be reassessed.
Comment 10 RHEL Product and Program Management 2013-10-18 16:35:36 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 11 Larry Woodman 2013-10-25 16:53:35 EDT
Andy, from Comment #9:  Whats cfs_period_us set to???

Larry
Comment 12 Seth Jennings 2013-10-30 12:44:54 EDT

*** This bug has been marked as a duplicate of bug 844450 ***

Note You need to log in before you can comment on or make changes to this bug.