Bug 240381 - cman fails with: "process_callback invalid recover event id 90"
Summary: cman fails with: "process_callback invalid recover event id 90"
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-05-16 21:33 UTC by Brad Walker
Modified: 2009-04-16 20:31 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-30 18:05:28 UTC
Embargoed:


Attachments (Terms of Use)

Description Brad Walker 2007-05-16 21:33:10 UTC
Description of problem:

I have a problem with CMAN on a 3 node cluster. The error message that I'm
getting is:

flsrv02 kernel: CMAN: Missed a heartbeat!
flsrv02 last message repeated 3 times
flsrv02 kernel: CMAN: node flsrv01 has been removed from the cluster : Missed
too many heartbeats
flsrv02 kernel: SM: 03000011 process_callback invalid recover event id 90

So node flsrv01 has missed heartbeats and is being fenced. But, the
process_callback notices there is no recovery registration for this event.

It is worth noting that on node 1 there were no CMAN kernel threads running. So
it makes me think that node 1 is leaving the cluster at the same time that node
2 and node 3 notice that heartbeats have been missed.

Version-Release number of selected component (if applicable):

1.0.4

How reproducible:

not sure..

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Christine Caulfield 2007-05-17 16:10:22 UTC
It's quite likely there will be no cman threads running on node1 if it has been
removed from the cluster - if the remainder of the cluster is quorate (as it
will be in a 3 node cluster) then a KILL message will be sent, just in case.

I'm not sure about the SM message, did the remaining nodes carry on OK? if not
what happened ?

Also, I spy a hacked kernel source: "Missed a heartbeat!" is not a standard cman
message :)

Comment 2 Brad Walker 2007-05-17 20:25:49 UTC
Yes, I did notice that all the cman kernel threads on node 1 were gone..

Yes, the remaining nodes did carry on OK..

Yep, the kernel message was something added by me to help determine cause..

Comment 4 David Teigland 2007-05-23 18:15:56 UTC
group id 03000011 (level 3) is for rgmanager, which can behave somewhat
out of step with what cman/sm expect for the "normal" fence/dlm/gfs groups
(rgmanager came along well after sm was written).  This is probably a
harmless message that can be ignored; it likely came up because of the
way in which rgmanager happened to process and acknowledge overlapping,
asynchronous events.  Unless the cluster or other groups were stuck/hung,
I'll close this one as not-a-bug.



Note You need to log in before you can comment on or make changes to this bug.