Description of problem:
On systems w/ gfs mounted FS, cman, and fencing, we had a node fence.
Immeditately before the fence, in /var/log/messages on the node, we see the
following assert message -
sleeping function called from invalid context at mm/rmap.c:85
Version-Release number of selected component (if applicable):
Seen one time so far.
Steps to Reproduce:
Can we get the logs from each of the nodes in the cluster? Since the node was
fenced, I assume that there is no other information available for that machine,
like process listings, memory usage, etc.
Created attachment 133108 [details]
messages file at time of assert
messages.1 file shows console messages at time of assert. Getting other
messages files as well.
On looking at these logs, I've seen this rmap assert occur on some nodes with
different thread stacks. Can you give me any insight into this from your
perspective ? Thanks.
See bug #172944 and /var/log/messages files (gz/tar) that I appended for each
node in cluster.
The node2 fenced node1. Right before the fence event, we had this rmap assert
on node1. Date was Jul 13,19:35.
Note: I just now noticed that we see this rmap.c assert in other time/logs also
on this cluster with different thread stacks leading to assert. These have our
crosswalk module in the stack, so I am investigating those here. Thanks.
I just noticed that the rmap assert that occurred did not have the complete
stack in the messages file because the fence occurred immediately after. I
only saw the two frames when I filed the bug.
Also, this assert is a debug mesage only and does not panic the system.
We have seen this message before in logs and the node is not typically fenced
so this assert message needs to be debugged but perhaps has nothing to do with
the cause of the node fencing.
Complete stack comes from an system call, ioctl, into our crosswalk kernel
module. I will look at it also. Any insights you rhat folks have are welcome.
Full stack appended:
Jul 3 05:30:20 igrid03 kernel: in_atomic():0[expected: 0], irqs_disabled():1
Jul 3 05:30:20 igrid03 kernel:
Jul 3 05:30:20 igrid03 kernel: Call Trace:<ffffffff8012f95c
I didn't see any information in the logs to indicate there was a
problem with Cluster Suite. The messages in comment #5 may indicate
that irqs were disabled when the sleep happened, which might be why
the heartbeat messages were not passed from the fenced node to the other
nodes, thus causing it to be fenced. That's just a theory.
Obviously, one of the other nodes did not see the heartbeat messages,
which could also be caused by hardware problems. I don't think we can
determine the cause with the information provided.
Perhaps you can recreate the problem with fencing disabled (i.e.
temporarily use manual fencing) and once the system stops responding,
use the sysrq key or echo "t" > /proc/sysrq-trigger from the fenced
node and add that here as an attachment.
I have found no information that leads me to believe this problem
is related to cluster suite or GFS. If we get more information,
feel free to reopen the bug.