From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901 Description of problem: When the cluster is stopped. clustat and cluadmin both segfault and errors appear in /var/log/messages. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.service cluster stop 2.clustat or cluadmin 3. Actual Results: 3 second delay and then: Segmentation fault Expected Results: I would have expected a message indicating that the cluster was not operational or running or somesuch. Additional info: Here's the errors that appeared in /var/log/messages: Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write. Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate partition /dev/raw/raw1. Configuration error? Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write. Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate partition /dev/raw/raw1. Configuration error? Apr 10 16:30:58 clue clustat[11103]: <err> Unable to open /dev/raw/raw1 read/write. Apr 10 16:30:58 clue clustat[11103]: <crit> initSharedFD: unable to validate partition /dev/raw/raw1. Configuration error? Apr 10 16:30:58 clue clustat[11103]: <err> readNetBlock: bad ret -1 from diskRawReadShadow Apr 10 16:30:58 clue clustat[11103]: <err> getNetBlockData: IO error reading quorum partition. Apr 10 16:30:58 clue clustat[11103]: <err> msg_svc_init: Unable to read session_id. Apr 10 16:30:58 clue clustat[11103]: <err> msg_open: unable to initialize msg subsystem. Apr 10 16:30:58 clue clustat[11103]: <crit> _clu_write_lock: bad return from lockWrite, ret = -1
What version of clumanager is this? I'm not seeing this behavior with clumanager-1.0.9-1
rpm -q clumanager clumanager-1.0.9-1 And I have crucial updated information to this incident. After speaking with the user of this cluster, it turns out that what they had done is unloaded the qlogic driver module from the kernel [rmmod qla2x00] which essentially removed access to the shared storage [QUORUM DEVICE!!!]. So this is certainly a very rare occurrence. However, we should probably seek out an additional error message rather than segfaulting. That may not be reasonable given the unusual cirmumstances.. So, to reproduce this: service cluster stop rmmod qla2x00 [or whatever shared storage driver module you have] clustat I'm bumping the severity down to low given the fringe case of this bug. John
This is a Winchell-ism in the clulib error path (or lack thereof).
Implemented lock error paths back up. Now they produce errors, but don't segfault. This is probably more expected behavior.
This same thing happens on machines which are not cluster members at all (and hence have no shared storage). This was due to the fact that the clu_lock() never could return an error condition, and would simply effectively raise(SIGSEGV) as the result instead of returning an error condition. Fixed in current pool.