Red Hat Bugzilla – Bug 240381
cman fails with: "process_callback invalid recover event id 90"
Last modified: 2009-04-16 16:31:38 EDT
Description of problem:
I have a problem with CMAN on a 3 node cluster. The error message that I'm
flsrv02 kernel: CMAN: Missed a heartbeat!
flsrv02 last message repeated 3 times
flsrv02 kernel: CMAN: node flsrv01 has been removed from the cluster : Missed
too many heartbeats
flsrv02 kernel: SM: 03000011 process_callback invalid recover event id 90
So node flsrv01 has missed heartbeats and is being fenced. But, the
process_callback notices there is no recovery registration for this event.
It is worth noting that on node 1 there were no CMAN kernel threads running. So
it makes me think that node 1 is leaving the cluster at the same time that node
2 and node 3 notice that heartbeats have been missed.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
It's quite likely there will be no cman threads running on node1 if it has been
removed from the cluster - if the remainder of the cluster is quorate (as it
will be in a 3 node cluster) then a KILL message will be sent, just in case.
I'm not sure about the SM message, did the remaining nodes carry on OK? if not
what happened ?
Also, I spy a hacked kernel source: "Missed a heartbeat!" is not a standard cman
Yes, I did notice that all the cman kernel threads on node 1 were gone..
Yes, the remaining nodes did carry on OK..
Yep, the kernel message was something added by me to help determine cause..
group id 03000011 (level 3) is for rgmanager, which can behave somewhat
out of step with what cman/sm expect for the "normal" fence/dlm/gfs groups
(rgmanager came along well after sm was written). This is probably a
harmless message that can be ignored; it likely came up because of the
way in which rgmanager happened to process and acknowledge overlapping,
asynchronous events. Unless the cluster or other groups were stuck/hung,
I'll close this one as not-a-bug.