| Summary: | race condition in mm_for_maps | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Andy Grimm <agrimm> | ||||
| Component: | kernel | Assignee: | Seth Jennings <sjenning> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.4 | CC: | jeder, jgoulding, lwoodman | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-10-30 16:44:54 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Andy Grimm
2013-10-09 17:11:33 UTC
Andy, can you get the kernel stack trace of the process thats hanging in mm_for_maps()? Thanks! Larry I finally got two of these today. They are identical: [<ffffffff811eae1b>] mm_for_maps+0x2b/0xa0 [<ffffffff811e7dbd>] m_start+0x6d/0x160 [<ffffffff811a52c6>] seq_read+0x96/0x400 [<ffffffff81181a95>] vfs_read+0xb5/0x1a0 [<ffffffff81181bd1>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff I have now found, though, that this is the same underlying cause as the call_rwsem_down_read_failed issue (rhbz844450) . It is the result of a process with a lock in a gear/cgroup which is receiving no CPU cycles. I can unstick things by running: echo -1 > /cgroup/all/openshift/<UUID>/cpu.cfs_quota_us where <UUID> is the "stuck" gear which is holding the lock. Andy, sorry for my confusion here but is this all thats in the messages file??? It doesnt tell me what mm_for_maps() is doing, it can block and/or spin on a lock, since you say its receiving no CPU cycles I assume its blocked down in the bowls of __mutex_lock_common() but cant prove it. Can I either get access to a system in this state or can you get me an AltSysrq-T output so I can see exactly what its doing. Also, you said this is the same problem as BZ844450 where a process hangs on the mmap _sem semaphore. Where does mm_for_maps() get called with the mmap_sem held? Larry Larry, the confusion is my fault. I got a little excited about the workaround to BZ844450, which also works around this bug, but the bug itself here is different, and I've now conflated them. So, what I provided above was simply the contents of /proc/<PID>/stack for the stuck lsof process. There's nothing in /var/log/messages for these, as far as I could tell. (I know sometimes there's a "task hung for 120 seconds" message or something for hung processes, but I did not notice that for these.) I will grab AltSysrq-T output next time I see the issue, or I'll ping you to get you access if the system is in a state where we can spend time debugging. I have to say that, while I find this issue intriguing, the priority is pretty low for us compared to BZ844450, which, while we've got a workaround, requires some pretty "smart" scripts to auto-recover from. Ok, I got another instance of this. I will try again to walk through how I debugged and worked around it. First, the symptom is this: [root ~]# ps -o user,pid,ppid,state,wchan:32,cmd -p 23679,23683 USER PID PPID S WCHAN CMD root 23679 3716 D mm_for_maps /usr/sbin/lsof -wnlP +c 0 root 23683 23679 S pipe_wait /usr/sbin/lsof -wnlP +c 0 [root ~]# ls -l /proc/23679/fd total 0 lr-x------. 1 root root 64 Oct 15 14:23 0 -> pipe:[385978154] l-wx------. 1 root root 64 Oct 15 14:23 1 -> pipe:[386422190] l-wx------. 1 root root 64 Oct 15 14:23 2 -> pipe:[386422190] lr-x------. 1 root root 64 Oct 15 14:23 3 -> /proc lr-x------. 1 root root 64 Oct 15 14:23 4 -> /proc/22283/maps l-wx------. 1 root root 64 Oct 15 14:23 5 -> pipe:[386422205] lr-x------. 1 root root 64 Oct 15 14:23 6 -> pipe:[386422206] [root ~]# cat /proc/23679/stack [<ffffffff811eac2b>] mm_for_maps+0x2b/0xa0 [<ffffffff811e7bcd>] m_start+0x6d/0x160 [<ffffffff811a5096>] seq_read+0x96/0x400 [<ffffffff81181845>] vfs_read+0xb5/0x1a0 [<ffffffff81181981>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff I should note that any attempt to run another instance of lsof at this point will hang in at the same point. Now, let's look at the PID that it's stuck on: [root ~]# ls -la /proc/22283/ total 0 dr-xr-xr-x. 7 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 07:40 . dr-xr-xr-x. 714 root root 0 Sep 9 12:47 .. dr-xr-xr-x. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:45 attr -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 autogroup -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 auxv -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cgroup --w-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 clear_refs -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:32 cmdline -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 coredump_filter -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cpuset lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 cwd -> /var/lib/openshift/525bf22a500446fc4b0000fe/app-root/runtime/repo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 environ lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 exe -> /opt/rh/nodejs010/root/usr/bin/node dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 fd dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 fdinfo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 io -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 limits -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 loginuid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 09:24 maps -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mem -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mountinfo -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mounts -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 mountstats dr-xr-xr-x. 6 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 net -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 numa_maps -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_score -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 oom_score_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 pagemap -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 personality lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 root -> / -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 sched -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 schedstat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 sessionid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 smaps -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 stack -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:35 stat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 statm -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:32 status -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 syscall dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 task -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:01 wchan [root ~]# ls -la /proc/22283/task total 0 dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 . dr-xr-xr-x. 7 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 07:40 .. dr-xr-xr-x. 5 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 22283 [root ~]# ls -la /proc/22283/task/22283 total 0 dr-xr-xr-x. 5 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 . dr-xr-xr-x. 3 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 14:23 .. dr-xr-xr-x. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 attr -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 auxv -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cgroup --w-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 clear_refs -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cmdline -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cpuset lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 cwd -> /var/lib/openshift/525bf22a500446fc4b0000fe/app-root/runtime/repo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 environ lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 exe -> /opt/rh/nodejs010/root/usr/bin/node dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 fd dr-x------. 2 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 fdinfo -r--------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 io -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 limits -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 loginuid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 maps -rw-------. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mem -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mountinfo -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 mounts -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 numa_maps -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_score -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 oom_score_adj -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 pagemap -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 personality lrwxrwxrwx. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 root -> / -rw-r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 sched -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 schedstat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 sessionid -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 smaps -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 stack -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 stat -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 statm -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 status -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 syscall -r--r--r--. 1 525bf22a500446fc4b0000fe 525bf22a500446fc4b0000fe 0 Oct 15 15:02 wchan and then at the process list for the gear that wons this task: [root ~]# ps -u 525bf22a500446fc4b0000fe -o pid,ppid,state,wchan:32,pending,comm PID PPID S WCHAN PENDING COMMAND 21884 1 S pipe_wait 0000000000000000 node 22283 21884 R - 0000000000000000 node So, two processes, and the problematic one is RUNNABLE. That's the important part. Now let's look at the cgroup's CPU usage: [root ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage 17111873633113 [root ~]# sleep 10; cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage 17111873633113 Here we've gone ten seconds without a single nanosecond of CPU consumed. But if we bump the scheduler, like so: [root ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpu.cfs_quota_us 30000 [root ~]# echo 35000 > /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpu.cfs_quota_us suddenly we get to use the CPU: [root ~]# cat /cgroup/all/openshift/525bf22a500446fc4b0000fe/cpuacct.usage 17113176341247 and the process exits cleanly: [root ~]# ps -u 525bf22a500446fc4b0000fe -o pid,ppid,state,wchan:32,pending,comm PID PPID S WCHAN PENDING COMMAND 21884 1 S ep_poll 0000000000000000 node 32301 21884 R - 0000000000000000 node and our hung lsof processes are gone: [root ~]# ps -o user,pid,ppid,state,wchan:32,cmd -p 23679,23683 USER PID PPID S WCHAN CMD So, that demonstrates that this is indeed related to BZ844450 . It's actually not clear to me whether needs to be addressed separately. I thought there were cases where we hit this without hitting BZ844450, but I'm no longer finding incidences of that. Created attachment 812661 [details]
sysrq -t data
Andy, wait a minute. Are you over-riding the default cpu cgroup values and if so what are their values now??? Larry BTW, there is no mm_for_maps in that sysrq -t output at all. Is the system burning a CPU on the process stuck there? If so echo L > /proc/sysrq-trigger. Yes, our cgroup configuration for each small gear sets the following: cpu.cfs_quota_us = 100000 cpu.shares = 128 memory.limit_in_bytes = 536870912 memory.move_charge_at_immigrate = 1 memory.memsw.limit_in_bytes = 641728512 Additionally, there is a "throttler" which changes cpu.cfs_quota_us to 30000 when a cgroup consumes too much CPU over a period of time. I would speculate that this is the cause of the problem, except that we saw this issue well before the throttler was implemented. Perhaps I'm confused now; I'm not sure why you expect to see mm_for_maps in the sysrq -t output. The process stuck in mm_for_maps is not on the CPU; it was waiting for a lock on /proc/22283/maps , which is held by process 22283, which was runnable, but not on the CPU. From the sysrq output: node 22283 7260750.330346 3 120 7260750.330346 16.559928 0.000000 /openshift/525bf22a500446fc4b0000fe The problem is that the scheduler was never putting this task on the CPU until I wrote a new value to the cgroup's cpu.cfs_quota_us. And it doesn't seem to matter what the value is, just that something is written to cause that cgroup's scheduler priority to be reassessed. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Andy, from Comment #9: Whats cfs_period_us set to??? Larry *** This bug has been marked as a duplicate of bug 844450 *** |