Bug 63176

Summary: clustat and cluadmin segfault when cluster stopped.
Product: Red Hat Enterprise Linux 2.1 Reporter: John Flanagan <flanagan>
Component: clumanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact:
Severity: low Docs Contact:
Priority: medium    
Version: 2.1   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-05-01 18:27:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Flanagan 2002-04-10 20:44:40 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901

Description of problem:
When the cluster is stopped.  clustat and cluadmin both segfault and errors
appear in /var/log/messages.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.service cluster stop
2.clustat or cluadmin
3.
	

Actual Results:  3 second delay and then:

Segmentation fault

Expected Results:  I would have expected a message indicating that the cluster
was not operational or running or somesuch.

Additional info:

Here's the errors that appeared in /var/log/messages:

Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write.
Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate
partition /dev/raw/raw1. Configuration error?
Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write.
Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate
partition /dev/raw/raw1. Configuration error?
Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write.
Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate
partition /dev/raw/raw1. Configuration error?
Apr 10 16:30:58 clue clustat[11103]: <err> readNetBlock: bad ret -1 from
diskRawReadShadow
Apr 10 16:30:58 clue clustat[11103]: <err> getNetBlockData: IO error reading
quorum partition.
Apr 10 16:30:58 clue clustat[11103]: <err> msg_svc_init: Unable to read session_id.
Apr 10 16:30:58 clue clustat[11103]: <err> msg_open: unable to initialize msg
subsystem.
Apr 10 16:30:58 clue clustat[11103]: <crit> _clu_write_lock: bad return from
lockWrite, ret = -1

Comment 1 Mike McLean 2002-04-11 13:27:19 UTC
What version of clumanager is this?  I'm not seeing this behavior with
clumanager-1.0.9-1


Comment 2 John Flanagan 2002-04-11 14:15:50 UTC
rpm -q clumanager
clumanager-1.0.9-1

And I have crucial updated information to this incident.  After speaking with
the user of this cluster, it turns out that what they had done is unloaded the
qlogic driver module from the kernel [rmmod qla2x00] which essentially removed
access to the shared storage [QUORUM DEVICE!!!].

So this is certainly a very rare occurrence.  However, we should probably seek
out an additional error message rather than segfaulting.  That may not be
reasonable given the unusual cirmumstances..

So, to reproduce this:

service cluster stop
rmmod qla2x00 [or whatever shared storage driver module you have]
clustat

I'm bumping the severity down to low given the fringe case of this bug.

John


Comment 3 Tim Burke 2002-04-23 12:24:17 UTC
This is a Winchell-ism in the clulib error path (or lack thereof).


Comment 4 Lon Hohberger 2002-05-01 18:27:52 UTC
Implemented lock error paths back up.  Now they produce errors, but don't
segfault.  This is probably more expected behavior.

Comment 5 Lon Hohberger 2002-05-07 20:41:22 UTC
This same thing happens on machines which are not cluster members at all (and
hence have no shared storage).  This was due to the fact that the clu_lock()
never could return an error condition, and would simply effectively
raise(SIGSEGV) as the result instead of returning an error condition.

Fixed in current pool.