63176 – clustat and cluadmin segfault when cluster stopped.

Bug 63176 - clustat and cluadmin segfault when cluster stopped.

Summary: clustat and cluadmin segfault when cluster stopped.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	clumanager
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-04-10 20:44 UTC by John Flanagan
Modified:	2008-05-01 15:38 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2002-05-01 18:27:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2002:226	0	normal	SHIPPED_LIVE	Fixes for clumanager addressing starvation and service hangs	2002-10-08 04:00:00 UTC

Description John Flanagan 2002-04-10 20:44:40 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
When the cluster is stopped.  clustat and cluadmin both segfault and errors
appear in /var/log/messages.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.service cluster stop
2.clustat or cluadmin
3.
	

Actual Results:  3 second delay and then:

Segmentation fault

Expected Results:  I would have expected a message indicating that the cluster
was not operational or running or somesuch.

Additional info:

Here's the errors that appeared in /var/log/messages:

Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write.
Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate
partition /dev/raw/raw1. Configuration error?
Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write.
Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate
partition /dev/raw/raw1. Configuration error?
Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write.
Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate
partition /dev/raw/raw1. Configuration error?
Apr 10 16:30:58 clue clustat[11103]: <err> readNetBlock: bad ret -1 from
diskRawReadShadow
Apr 10 16:30:58 clue clustat[11103]: <err> getNetBlockData: IO error reading
quorum partition.
Apr 10 16:30:58 clue clustat[11103]: <err> msg_svc_init: Unable to read session_id.
Apr 10 16:30:58 clue clustat[11103]: <err> msg_open: unable to initialize msg
subsystem.
Apr 10 16:30:58 clue clustat[11103]: <crit> _clu_write_lock: bad return from
lockWrite, ret = -1

Comment 1 Mike McLean 2002-04-11 13:27:19 UTC

What version of clumanager is this?  I'm not seeing this behavior with
clumanager-1.0.9-1

Comment 2 John Flanagan 2002-04-11 14:15:50 UTC

rpm -q clumanager
clumanager-1.0.9-1

And I have crucial updated information to this incident.  After speaking with
the user of this cluster, it turns out that what they had done is unloaded the
qlogic driver module from the kernel [rmmod qla2x00] which essentially removed
access to the shared storage [QUORUM DEVICE!!!].

So this is certainly a very rare occurrence.  However, we should probably seek
out an additional error message rather than segfaulting.  That may not be
reasonable given the unusual cirmumstances..

So, to reproduce this:

service cluster stop
rmmod qla2x00 [or whatever shared storage driver module you have]
clustat

I'm bumping the severity down to low given the fringe case of this bug.

John

Comment 3 Tim Burke 2002-04-23 12:24:17 UTC

This is a Winchell-ism in the clulib error path (or lack thereof).

Comment 4 Lon Hohberger 2002-05-01 18:27:52 UTC

Implemented lock error paths back up.  Now they produce errors, but don't
segfault.  This is probably more expected behavior.

Comment 5 Lon Hohberger 2002-05-07 20:41:22 UTC

This same thing happens on machines which are not cluster members at all (and
hence have no shared storage).  This was due to the fact that the clu_lock()
never could return an error condition, and would simply effectively
raise(SIGSEGV) as the result instead of returning an error condition.

Fixed in current pool.

Note You need to log in before you can comment on or make changes to this bug.