Description of problem: On systems w/ gfs mounted FS, cman, and fencing, we had a node fence. Immeditately before the fence, in /var/log/messages on the node, we see the following assert message - sleeping function called from invalid context at mm/rmap.c:85 in_atomic():0[expected: 0], irqs_disabled():1 <ffffffff8012f95c> {__might_sleep+173} <ffffffff80169cbd> {anon_vma_prepare+37} Version-Release number of selected component (if applicable): 2.6.9-34 How reproducible: Seen one time so far. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Can we get the logs from each of the nodes in the cluster? Since the node was fenced, I assume that there is no other information available for that machine, like process listings, memory usage, etc.
Created attachment 133108 [details] messages file at time of assert messages.1 file shows console messages at time of assert. Getting other messages files as well.
On looking at these logs, I've seen this rmap assert occur on some nodes with different thread stacks. Can you give me any insight into this from your perspective ? Thanks.
See bug #172944 and /var/log/messages files (gz/tar) that I appended for each node in cluster. The node2 fenced node1. Right before the fence event, we had this rmap assert on node1. Date was Jul 13,19:35. Note: I just now noticed that we see this rmap.c assert in other time/logs also on this cluster with different thread stacks leading to assert. These have our crosswalk module in the stack, so I am investigating those here. Thanks.
I just noticed that the rmap assert that occurred did not have the complete stack in the messages file because the fence occurred immediately after. I only saw the two frames when I filed the bug. Also, this assert is a debug mesage only and does not panic the system. We have seen this message before in logs and the node is not typically fenced so this assert message needs to be debugged but perhaps has nothing to do with the cause of the node fencing. Complete stack comes from an system call, ioctl, into our crosswalk kernel module. I will look at it also. Any insights you rhat folks have are welcome. Thanks. Full stack appended: Jul 3 05:30:20 igrid03 kernel: in_atomic():0[expected: 0], irqs_disabled():1 Jul 3 05:30:20 igrid03 kernel: Jul 3 05:30:20 igrid03 kernel: Call Trace:<ffffffff8012f95c {__might_sleep+173} <ffffffff80169cbd {anon_vma_prepare+37} <ffffffff80164498> {do_wp_page+321} <ffffffff80165483> {handle_mm_fault+1107} <ffffffff801de551> {__up_read+16} <ffffffff80120dbe> {do_page_fault+518} <ffffffff80130ab4> {default_wake_function+0} <ffffffff8015974f> {__pagevec_free+39} <ffffffff802e88ae> {fn_hash_lookup+224} <ffffffff8010fc35> {error_exit+0} <ffffffff801e0062 {copy_user_generic_c+8} <ffffffffa01c32f3> {:cwalk_igrid:cwalk_ioctl+182} <ffffffff80185ed5>{sys_ioctl+853} <ffffffff8010f19a>{system_call+126}
I didn't see any information in the logs to indicate there was a problem with Cluster Suite. The messages in comment #5 may indicate that irqs were disabled when the sleep happened, which might be why the heartbeat messages were not passed from the fenced node to the other nodes, thus causing it to be fenced. That's just a theory. Obviously, one of the other nodes did not see the heartbeat messages, which could also be caused by hardware problems. I don't think we can determine the cause with the information provided. Perhaps you can recreate the problem with fencing disabled (i.e. temporarily use manual fencing) and once the system stops responding, use the sysrq key or echo "t" > /proc/sysrq-trigger from the fenced node and add that here as an attachment.
I have found no information that leads me to believe this problem is related to cluster suite or GFS. If we get more information, feel free to reopen the bug.