Hide Forgot
Created attachment 562245 [details] earlgrey-1 magic sysrq blocked states 2011-12-09 Description of problem: We have seen a couple strange incidents on two different machines running the following kernels: 2.6.32-220.4.1.el6.x86_64 2.6.32-220.el6.x86_64 The symptom is that many processes run normally, but processes that attempt to get process info (ps, top, catting process info from /proc) just hang. On the system running 2.6.32-220.4.1 there were several messages like this in the system log: Feb 8 11:35:00 moxie-2 kernel: INFO: task khugepaged:181 blocked for more than 120 seconds. Feb 8 11:35:00 moxie-2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 8 11:35:00 moxie-2 kernel: khugepaged D 0000000000000000 0 181 2 0x00000000 Feb 8 11:35:00 moxie-2 kernel: ffff88062591dc90 0000000000000046 ffff88062591dc58 ffff88062591dc54 Feb 8 11:35:00 moxie-2 kernel: 0000000000015f80 ffff88062fc28400 ffff88033ac95f80 0000000000000400 Feb 8 11:35:00 moxie-2 kernel: ffff88062591bb38 ffff88062591dfd8 000000000000f4e8 ffff88062591bb38 Feb 8 11:35:00 moxie-2 kernel: Call Trace: Feb 8 11:35:00 moxie-2 kernel: [<ffffffff814eef25>] rwsem_down_failed_common+0x95/0x1d0 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff814ef083>] rwsem_down_write_failed+0x23/0x30 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff81276d83>] call_rwsem_down_write_failed+0x13/0x20 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff814ee582>] ? down_write+0x32/0x40 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff8116f140>] khugepaged+0x790/0x12c0 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff8116e9b0>] ? khugepaged+0x0/0x12c0 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff81090726>] kthread+0x96/0xa0 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff81090690>] ? kthread+0x0/0xa0 Feb 8 11:35:00 moxie-2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 I also captured some info using magic sysrq on the other system (attached). A number of tasks seem to be blocked in rwsem_down_failed_common. From my investigation so far I wonder if it might be related to either or both of these issues: https://bugzilla.redhat.com/show_bug.cgi?id=669418 https://lkml.org/lkml/2011/6/14/163 Both these systems are Dell PowerEdge R610s, one (moxie-2) has dual Xeon X5647s with 24GB of RAM and the other (earlgrey-1) has dual Xeon X5690s with 96GB of RAM. Version-Release number of selected component (if applicable): 2.6.32-220.4.1.el6.x86_64 2.6.32-220.el6.x86_64
Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
Kernel 2.6.32-279.9.1.el6.x86_64. Seeing this a lot on FhGFS over IB, and NFSversion 3 TCP over DIS (PCIe network). Deadlock in rwsem.c ?
*** This bug has been marked as a duplicate of bug 669418 ***