Hide Forgot
Description of problem: There's a condition frequently occurring in OpenShift Online where lsof is hanging in mm_for_maps. PIDs are recycled frequently (usually within a 5- to 10-minute interval), and I believe that the process whose maps entry lsof is trying to read is actually gone. It appears that the _next_ process to get the same PID hangs and must be sent a SIGKILL before any future lsof call will work properly. Example: # ps -eo user,pid,ppid,stat,pcpu,wchan:32,etime,cmd | /bin/awk 'NR==1; $4 ~ /D/' USER PID PPID STAT %CPU WCHAN ELAPSED CMD root 7542 31943 DN 0.0 mm_for_maps 02:30:53 /usr/sbin/lsof -wnlP +c 0 # ls -l /proc/7542/fd total 0 lr-x------. 1 root root 64 Oct 9 12:06 0 -> pipe:[29986258] l-wx------. 1 root root 64 Oct 9 12:06 1 -> pipe:[30459868] l-wx------. 1 root root 64 Oct 9 12:06 2 -> pipe:[30459868] lr-x------. 1 root root 64 Oct 9 12:06 3 -> /proc lr-x------. 1 root root 64 Oct 9 12:06 4 -> /proc/21085/maps l-wx------. 1 root root 64 Oct 9 12:06 5 -> pipe:[30459893] lr-x------. 1 root root 64 Oct 9 12:06 6 -> pipe:[30459894] # ls -l /proc/21085 total 0 dr-xr-xr-x. 2 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:45 attr -rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 autogroup -r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 auxv -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 cgroup --w-------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 clear_refs -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:39 cmdline -rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 coredump_filter -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 cpuset lrwxrwxrwx. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:34 cwd -> /var/lib/openshift/51e6a8b24382ec97f200009a/app-root/runtime/repo -r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 environ lrwxrwxrwx. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:39 exe -> /usr/bin/node dr-x------. 2 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 fd dr-x------. 2 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 fdinfo -r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 io -rw-------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 limits -rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 loginuid -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:34 maps -rw-------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 mem -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 mountinfo -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 mounts -r--------. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 mountstats dr-xr-xr-x. 6 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 net -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 numa_maps -rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 oom_adj -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 oom_score -rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 oom_score_adj -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 pagemap -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 personality lrwxrwxrwx. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:34 root -> / -rw-r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 sched -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 schedstat -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 sessionid -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 smaps -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 stack -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:39 stat -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 statm -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 09:39 status -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 syscall dr-xr-xr-x. 3 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 task -r--r--r--. 1 51e6a8b24382ec97f200009a 51e6a8b24382ec97f200009a 0 Oct 9 12:06 wchan (Note the unusual timestamp on /proc/21085/maps -- typically it's the same as other files in the directory) # /usr/sbin/lsof -wnlP +c 0 (hangs. hit ctrl-c. may need to kill -9) # ps -p 21085 -o user,pid,ppid,stat,pcpu,wchan:32,etime,comm USER PID PPID STAT %CPU WCHAN ELAPSED COMMAND 2383 21085 8328 R 0.0 - 1-02:56:44 node Next I confirm that the app served by this process in not actually working properly. The steps for that aren't relevant here. Getting rid of the problematic process sometimes just requires a "gentle" kill, and sometimes only works with something like "killall -9 -u <uid>". After process 21085 is killed by whatever means was necessary, the hung lsof command completes. Version-Release number of selected component (if applicable): kernel-2.6.32-358.18.1.el6.x86_64 lsof-4.82-4.el6.x86_64 How reproducible: I haven't been able to reproduce this on-demand, but it happens several times per day on OpenShift Online nodes. Additional info: In addition to PIDs being recycled, another characteristic of these systems is that most tasks are run in cgroups with memory limits, and it is common for tasks to be OOM killed. We are also using cgred, and while we see the following message frequently, it always occurs for the problematic PID: CGRE[1271]: Warning: cannot write tid 21085 to /cgroup/all/openshift/51e6a8b24382ec97f200009a//tasks:No such process I'm not sure whether that's related.
Andy, can you get the kernel stack trace of the process thats hanging in mm_for_maps()? Thanks! Larry
I finally got two of these today. They are identical: [<ffffffff811eae1b>] mm_for_maps+0x2b/0xa0 [<ffffffff811e7dbd>] m_start+0x6d/0x160 [<ffffffff811a52c6>] seq_read+0x96/0x400 [<ffffffff81181a95>] vfs_read+0xb5/0x1a0 [<ffffffff81181bd1>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff I have now found, though, that this is the same underlying cause as the call_rwsem_down_read_failed issue (rhbz844450) . It is the result of a process with a lock in a gear/cgroup which is receiving no CPU cycles. I can unstick things by running: echo -1 > /cgroup/all/openshift/<UUID>/cpu.cfs_quota_us where <UUID> is the "stuck" gear which is holding the lock.
Andy, sorry for my confusion here but is this all thats in the messages file??? It doesnt tell me what mm_for_maps() is doing, it can block and/or spin on a lock, since you say its receiving no CPU cycles I assume its blocked down in the bowls of __mutex_lock_common() but cant prove it. Can I either get access to a system in this state or can you get me an AltSysrq-T output so I can see exactly what its doing. Also, you said this is the same problem as BZ844450 where a process hangs on the mmap _sem semaphore. Where does mm_for_maps() get called with the mmap_sem held? Larry
Larry, the confusion is my fault. I got a little excited about the workaround to BZ844450, which also works around this bug, but the bug itself here is different, and I've now conflated them. So, what I provided above was simply the contents of /proc/<PID>/stack for the stuck lsof process. There's nothing in /var/log/messages for these, as far as I could tell. (I know sometimes there's a "task hung for 120 seconds" message or something for hung processes, but I did not notice that for these.) I will grab AltSysrq-T output next time I see the issue, or I'll ping you to get you access if the system is in a state where we can spend time debugging. I have to say that, while I find this issue intriguing, the priority is pretty low for us compared to BZ844450, which, while we've got a workaround, requires some pretty "smart" scripts to auto-recover from.
Ok, I got another instance of this. I will try again to walk through how I debugged and worked around it. First, the symptom is this: [root ~]# ps -o user,pid,ppid,state,wchan:32,cmd -p 23679,23683 USER PID PPID S WCHAN CMD root 23679 3716 D mm_for_maps /usr/sbin/lsof -wnlP +c 0 root 23683 23679 S pipe_wait /usr/sbin/lsof -wnlP +c 0 [root ~]# ls -l /proc/23679/fd total 0 lr-x------. 1 root root 64 Oct 15 14:23 0 -> pipe:[385978154] l-wx------. 1 root root 64 Oct 15 14:23 1 -> pipe:[386422190] l-wx------. 1 root root 64 Oct 15 14:23 2 -> pipe:[386422190] lr-x------. 1 root root 64 Oct 15 14:23 3 -> /proc lr-x------. 1 root root 64 Oct 15 14:23 4 -> /proc/22283/maps l-wx------. 1 root root 64 Oct 15 14:23 5 -> pipe:[386422205] lr-x------. 1 root root 64 Oct 15 14:23 6 -> pipe:[386422206] [root ~]# cat /proc/23679/stack [<ffffffff811eac2b>] mm_for_maps+0x2b/0xa0 [<ffffffff811e7bcd>] m_start+0x6d/0x160 [<ffffffff811a5096>] seq_read+0x96/0x400 [<ffffffff81181845>] vfs_read+0xb5/0x1a0 [<ffffffff81181981>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff I should note that any attempt to run another instance of lsof at this point will hang in at the same point. Now, let's look at the PID that it's stuck on: [root ~]# ls -la /proc/22283/ total 0 dr-xr-xr-x. 7 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 07:40 . dr-xr-xr-x. 714 root root 0 Sep 9 12:47 .. dr-xr-xr-x. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:45 attr -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 autogroup -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 auxv -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cgroup --w-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 clear_refs -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:32 cmdline -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 coredump_filter -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cpuset lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cwd -> /var/lib/openshift/525bf22a500446fc4b0000fe/app-root/runtime/repo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 environ lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 exe -> /opt/rh/nodejs010/root/usr/bin/node dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 fd dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 fdinfo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 io -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 limits -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 loginuid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 09:24 maps -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mem -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mountinfo -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mounts -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mountstats dr-xr-xr-x. 6 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 net -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 numa_maps -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_score -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_score_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 pagemap -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 personality lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 root -> / -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 sched -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 schedstat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 sessionid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 smaps -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 stack -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:35 stat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 statm -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:32 status -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 syscall dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 task -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 wchan [root ~]# ls -la /proc/22283/task total 0 dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 . dr-xr-xr-x. 7 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 07:40 .. dr-xr-xr-x. 5 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 22283 [root ~]# ls -la /proc/22283/task/22283 total 0 dr-xr-xr-x. 5 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 . dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 .. dr-xr-xr-x. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 attr -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 auxv -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cgroup --w-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 clear_refs -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cmdline -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cpuset lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cwd -> /var/lib/openshift/525bf22a500446fc4b0000fe/app-root/runtime/repo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 environ lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 exe -> /opt/rh/nodejs010/root/usr/bin/node dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 fd dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 fdinfo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 io -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 limits -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 loginuid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 maps -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mem -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mountinfo -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mounts -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 numa_maps -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_score -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_score_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 pagemap -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 personality lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 root -> / -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 sched -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 schedstat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 sessionid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 smaps -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 stack -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 stat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 statm -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 status -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 syscall -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 wchan and then at the process list for the gear that wons this task: [root ~]# ps -u 525bf22a500446fc4b0000fe -o pid,ppid,state,wchan:32,pending,comm PID PPID S WCHAN PENDING COMMAND 21884 1 S pipe_wait 0000000000000000 node 22283 21884 R - 0000000000000000 node So, two processes, and the problematic one is RUNNABLE. That's the important part. Now let's look at the cgroup's CPU usage: [root ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage 17111873633113 [root ~]# sleep 10; cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage 17111873633113 Here we've gone ten seconds without a single nanosecond of CPU consumed. But if we bump the scheduler, like so: [root ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpu.cfs_quota_us 30000 [root ~]# echo 35000 > /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpu.cfs_quota_us suddenly we get to use the CPU: [root ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage 17113176341247 and the process exits cleanly: [root ~]# ps -u 525bf22a500446fc4b0000fe -o pid,ppid,state,wchan:32,pending,comm PID PPID S WCHAN PENDING COMMAND 21884 1 S ep_poll 0000000000000000 node 32301 21884 R - 0000000000000000 node and our hung lsof processes are gone: [root ~]# ps -o user,pid,ppid,state,wchan:32,cmd -p 23679,23683 USER PID PPID S WCHAN CMD So, that demonstrates that this is indeed related to BZ844450 . It's actually not clear to me whether needs to be addressed separately. I thought there were cases where we hit this without hitting BZ844450, but I'm no longer finding incidences of that.
Created attachment 812661 [details] sysrq -t data
Andy, wait a minute. Are you over-riding the default cpu cgroup values and if so what are their values now??? Larry BTW, there is no mm_for_maps in that sysrq -t output at all. Is the system burning a CPU on the process stuck there? If so echo L > /proc/sysrq-trigger.
Yes, our cgroup configuration for each small gear sets the following: cpu.cfs_quota_us = 100000 cpu.shares = 128 memory.limit_in_bytes = 536870912 memory.move_charge_at_immigrate = 1 memory.memsw.limit_in_bytes = 641728512 Additionally, there is a "throttler" which changes cpu.cfs_quota_us to 30000 when a cgroup consumes too much CPU over a period of time. I would speculate that this is the cause of the problem, except that we saw this issue well before the throttler was implemented. Perhaps I'm confused now; I'm not sure why you expect to see mm_for_maps in the sysrq -t output. The process stuck in mm_for_maps is not on the CPU; it was waiting for a lock on /proc/22283/maps , which is held by process 22283, which was runnable, but not on the CPU. From the sysrq output: node 22283 7260750.330346 3 120 7260750.330346 16.559928 0.000000 /openshift/525bf22a500446fc4b0000fe The problem is that the scheduler was never putting this task on the CPU until I wrote a new value to the cgroup's cpu.cfs_quota_us. And it doesn't seem to matter what the value is, just that something is written to cause that cgroup's scheduler priority to be reassessed.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Andy, from Comment #9: Whats cfs_period_us set to??? Larry
*** This bug has been marked as a duplicate of bug 844450 ***