Description of problem: I have a problem with CMAN on a 3 node cluster. The error message that I'm getting is: flsrv02 kernel: CMAN: Missed a heartbeat! flsrv02 last message repeated 3 times flsrv02 kernel: CMAN: node flsrv01 has been removed from the cluster : Missed too many heartbeats flsrv02 kernel: SM: 03000011 process_callback invalid recover event id 90 So node flsrv01 has missed heartbeats and is being fenced. But, the process_callback notices there is no recovery registration for this event. It is worth noting that on node 1 there were no CMAN kernel threads running. So it makes me think that node 1 is leaving the cluster at the same time that node 2 and node 3 notice that heartbeats have been missed. Version-Release number of selected component (if applicable): 1.0.4 How reproducible: not sure.. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
It's quite likely there will be no cman threads running on node1 if it has been removed from the cluster - if the remainder of the cluster is quorate (as it will be in a 3 node cluster) then a KILL message will be sent, just in case. I'm not sure about the SM message, did the remaining nodes carry on OK? if not what happened ? Also, I spy a hacked kernel source: "Missed a heartbeat!" is not a standard cman message :)
Yes, I did notice that all the cman kernel threads on node 1 were gone.. Yes, the remaining nodes did carry on OK.. Yep, the kernel message was something added by me to help determine cause..
group id 03000011 (level 3) is for rgmanager, which can behave somewhat out of step with what cman/sm expect for the "normal" fence/dlm/gfs groups (rgmanager came along well after sm was written). This is probably a harmless message that can be ignored; it likely came up because of the way in which rgmanager happened to process and acknowledge overlapping, asynchronous events. Unless the cluster or other groups were stuck/hung, I'll close this one as not-a-bug.