Bug 240381 - cman fails with: "process_callback invalid recover event id 90"
cman fails with: "process_callback invalid recover event id 90"
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
All Linux
medium Severity high
: ---
: ---
Assigned To: David Teigland
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2007-05-16 17:33 EDT by Brad Walker
Modified: 2009-04-16 16:31 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-05-30 14:05:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Brad Walker 2007-05-16 17:33:10 EDT
Description of problem:

I have a problem with CMAN on a 3 node cluster. The error message that I'm
getting is:

flsrv02 kernel: CMAN: Missed a heartbeat!
flsrv02 last message repeated 3 times
flsrv02 kernel: CMAN: node flsrv01 has been removed from the cluster : Missed
too many heartbeats
flsrv02 kernel: SM: 03000011 process_callback invalid recover event id 90

So node flsrv01 has missed heartbeats and is being fenced. But, the
process_callback notices there is no recovery registration for this event.

It is worth noting that on node 1 there were no CMAN kernel threads running. So
it makes me think that node 1 is leaving the cluster at the same time that node
2 and node 3 notice that heartbeats have been missed.

Version-Release number of selected component (if applicable):


How reproducible:

not sure..

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 1 Christine Caulfield 2007-05-17 12:10:22 EDT
It's quite likely there will be no cman threads running on node1 if it has been
removed from the cluster - if the remainder of the cluster is quorate (as it
will be in a 3 node cluster) then a KILL message will be sent, just in case.

I'm not sure about the SM message, did the remaining nodes carry on OK? if not
what happened ?

Also, I spy a hacked kernel source: "Missed a heartbeat!" is not a standard cman
message :)
Comment 2 Brad Walker 2007-05-17 16:25:49 EDT
Yes, I did notice that all the cman kernel threads on node 1 were gone..

Yes, the remaining nodes did carry on OK..

Yep, the kernel message was something added by me to help determine cause..
Comment 4 David Teigland 2007-05-23 14:15:56 EDT
group id 03000011 (level 3) is for rgmanager, which can behave somewhat
out of step with what cman/sm expect for the "normal" fence/dlm/gfs groups
(rgmanager came along well after sm was written).  This is probably a
harmless message that can be ignored; it likely came up because of the
way in which rgmanager happened to process and acknowledge overlapping,
asynchronous events.  Unless the cluster or other groups were stuck/hung,
I'll close this one as not-a-bug.

Note You need to log in before you can comment on or make changes to this bug.