240381 – cman fails with: "process_callback invalid recover event id 90"

Bug 240381 - cman fails with: "process_callback invalid recover event id 90"

Summary: cman fails with: "process_callback invalid recover event id 90"

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-16 21:33 UTC by Brad Walker
Modified:	2009-04-16 20:31 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-05-30 18:05:28 UTC
Embargoed:

Attachments	(Terms of Use)

Description Brad Walker 2007-05-16 21:33:10 UTC

Description of problem:

I have a problem with CMAN on a 3 node cluster. The error message that I'm
getting is:

flsrv02 kernel: CMAN: Missed a heartbeat!
flsrv02 last message repeated 3 times
flsrv02 kernel: CMAN: node flsrv01 has been removed from the cluster : Missed
too many heartbeats
flsrv02 kernel: SM: 03000011 process_callback invalid recover event id 90

So node flsrv01 has missed heartbeats and is being fenced. But, the
process_callback notices there is no recovery registration for this event.

It is worth noting that on node 1 there were no CMAN kernel threads running. So
it makes me think that node 1 is leaving the cluster at the same time that node
2 and node 3 notice that heartbeats have been missed.

Version-Release number of selected component (if applicable):

1.0.4

How reproducible:

not sure..

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Christine Caulfield 2007-05-17 16:10:22 UTC

It's quite likely there will be no cman threads running on node1 if it has been
removed from the cluster - if the remainder of the cluster is quorate (as it
will be in a 3 node cluster) then a KILL message will be sent, just in case.

I'm not sure about the SM message, did the remaining nodes carry on OK? if not
what happened ?

Also, I spy a hacked kernel source: "Missed a heartbeat!" is not a standard cman
message :)

Comment 2 Brad Walker 2007-05-17 20:25:49 UTC

Yes, I did notice that all the cman kernel threads on node 1 were gone..

Yes, the remaining nodes did carry on OK..

Yep, the kernel message was something added by me to help determine cause..

Comment 4 David Teigland 2007-05-23 18:15:56 UTC

group id 03000011 (level 3) is for rgmanager, which can behave somewhat
out of step with what cman/sm expect for the "normal" fence/dlm/gfs groups
(rgmanager came along well after sm was written).  This is probably a
harmless message that can be ignored; it likely came up because of the
way in which rgmanager happened to process and acknowledge overlapping,
asynchronous events.  Unless the cluster or other groups were stuck/hung,
I'll close this one as not-a-bug.

Note You need to log in before you can comment on or make changes to this bug.