Bug 173633 - cman/sm nodeid lookup fails
cman/sm nodeid lookup fails
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-11-18 14:14 EST by Scott Cannata
Modified: 2009-04-16 16:30 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-04 12:52:31 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ascii file output from kdb, see above description (130.93 KB, text/plain)
2005-11-18 14:14 EST, Scott Cannata
no flags Details

  None (edit)
Description Scott Cannata 2005-11-18 14:14:07 EST
Description of problem:

System running 2.6.9-22 kernel sitting idle went into kdb with
Assertion failed on line 52 of file cluster/cman/sm_misc.c

Attached typescript file has kdb: bt, sr -t, ps, dmesg output.

Version-Release number of selected component (if applicable):

kernel = 2.6.9-22
cman-1.0.2-0
cman-devel-1.0.2-0






How reproducible:

Not sure. We've only seen it once to date and the system
was not doing anything. Testers came in and found it kdb.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Scott Cannata 2005-11-18 14:14:09 EST
Created attachment 121241 [details]
ascii file output from kdb, see above description
Comment 2 David Teigland 2005-11-22 13:43:22 EST
Could you verify that cluster.conf was the same on all nodes?,
and a copy of that may be helpful to see, along with
'cman_tool nodes' from one of the other nodes if they are
still running.

This assertion failure may indicate some sort of internal
consistency problem within cman: the sm portion is looking
up a nodeid that the cnxman portion doesn't know about, which
shouldn't be possible.  If the assignment of nodeid's to nodes
was changing while the cluster was running (different versions
of cluster.conf on the nodes possibly), that might lead to this
kind of error.  If untypically large nodeids are being used, that
may point toward the cnxman code that dynamically increases the
standard node arrays.

This assertion failure was reported once before to me in an
email (in May by Dan Phung) and he had been updating cluster.conf
on some nodes.

I'm adding a printk to the code to provide a little more
information in the assertion message if this happens again.
Comment 3 David Teigland 2006-01-04 11:19:02 EST
waiting for someone to see this again and report with the
additional info from the panic

Note You need to log in before you can comment on or make changes to this bug.