303981 – clurgmgr sefaults upon startup after cluster is stopped

Bug 303981 - clurgmgr sefaults upon startup after cluster is stopped

Summary: clurgmgr sefaults upon startup after cluster is stopped

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-24 20:39 UTC by Chris Harms
Modified:	2009-04-16 22:55 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2008-0353
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 14:30:36 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
core dump of rgmanager (76.91 KB, application/octet-stream) 2007-09-24 21:37 UTC, Chris Harms	no flags	Details
Patch (880 bytes, patch) 2007-09-28 19:33 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0353	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2008-05-20 12:46:24 UTC

Description Chris Harms 2007-09-24 20:39:51 UTC

Description of problem:


Version-Release number of selected component (if applicable):
rgmanager-2.0.28-1.el5

How reproducible:
consistent

Steps to Reproduce:
1. Stop cluster via Luci
2. Start cluster via Luci
3.
  
Actual results:
clurgmgr segfaults on one of my two nodes (the same node each time).

Expected results:
normal startup

Additional info:
Running 5.1 Beta 1 of all software.  The node that crashes appears differently
when the cluster is viewed in Luci.  It is grayed out and the listed operations
one can perform is fence or force deletion of node, whereas the other node has
all available options listed in the drop-down.

Comment 1 Lon Hohberger 2007-09-24 21:11:20 UTC

Which node (node ID 1 or 2) ?

Comment 2 Lon Hohberger 2007-09-24 21:14:24 UTC

Actually - the easiest thing to do is create /etc/sysconfig/cluster w/ the
following contents:

DAEMON_COREFILE_LIMIT="unlimited"
RGMGR_OPTS="-w"

This will cause clurgmgrd to produce a core file in the root directory -- could
you attach the core and your cluster configuration?

Comment 3 Lon Hohberger 2007-09-24 21:15:58 UTC

Fixing product

Comment 4 Chris Harms 2007-09-24 21:37:38 UTC

Created attachment 204601 [details]
core dump of rgmanager

core dump of clurgmgr on cluster startup

Comment 5 Chris Harms 2007-09-24 21:39:13 UTC

(In reply to comment #1)
> Which node (node ID 1 or 2) ?

Node 2

Comment 6 Lon Hohberger 2007-09-28 18:32:48 UTC

Wow... thanks for the core. :)

Comment 7 Lon Hohberger 2007-09-28 19:27:35 UTC

Ok, so...

We received a VF_VIEW_FORMED message during for a transaction we did not have
recorded.  The transaction was allegedly from node 1, transaction ID 1, and came
immediately after node 2 had received the PORTOPENED status from node 1.

What normally happens is nodes request current states of distributed data when
they access it.  This means that it's safe to just throw away messages for
pieces of data we don't have.

This bug is restricted to RHEL5 because RHEL4 doesn't use CMAN's excellent
multicast capabilities.  This means that in the same situation on RHEL4, the
socket with the unwanted data would not have been opened at this point.

This is rather easy to fix.

Comment 8 Lon Hohberger 2007-09-28 19:33:23 UTC

Created attachment 210861 [details]
Patch

Comment 9 Lon Hohberger 2007-09-28 19:45:59 UTC

All the other parts of vf_process_msg() seem to correctly ignore messages for
which there is no key node associated.

Comment 10 Lon Hohberger 2007-11-14 16:58:14 UTC

Patch in CVS:

http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/clulib/vft.c.diff?cvsroot=cluster&only_with_tag=RHEL5&r1=1.17.2.3&r2=1.17.2.4

Comment 12 RHEL Program Management 2007-11-14 17:04:26 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 errata-xmlrpc 2008-05-21 14:30:36 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0353.html

Note You need to log in before you can comment on or make changes to this bug.