Description of problem: Sometimes during cluster_cleanup, shutting down cmirror fails, stopping our regression tests in their tracks. The failing call to `service cmirror stop` looks like this when run through sh -x: + echo -n 'Stopping clustered mirror log server:' Stopping clustered mirror log server:+ killall clogd + ps -C clogd + failure shutdown + local rc=0 The init script looks like so: stop() { echo -n "Stopping clustered mirror log server:" killall clogd >& /dev/null if ps -C clogd >& /dev/null; then failure "shutdown" echo return 1 else success "shutdown" fi ... If clogd doesn't exit immediately after the signal handler returns, ps could still find clogd running, even though a few moments later it will exit. This should probably be replaced by a call to killproc which waits for the daemon to exit. Version-Release number of selected component (if applicable): cmirror-1.1.39-2.el5 How reproducible: 10% of the time, more frequently on ppc and ia64 or smp systems Steps to Reproduce: 1. /etc/init.d/cmirror stop 2. 3. Actual results: See above Expected results: cmirror initscript should wait for clogd to exit before continuing. Additional info:
Also check the lines regarding the kernel module removal. I can see this failing during our tests quite frequently because the module is not loaded. It should be checked if the module is loaded and then try to remove it. This also makes it possible to call the script twice with stop parameter and not fail.
Ok, I'll try to fix these. However, reading the killproc man page, I'm not sure that it is appropriate. clogd will refuse to honor SIGTERM if there are still active cluster mirrors. Failing the SIGTERM, killproc will issue a SIGKILL - which we do not want... So, we want to wait for the SIGTERM to take effect, but not to issue the subsequent SIGKILL... rather, it should fail in that event.
commit 6a5f12eea4e117b3ed50016e737fc4847ccbd634 Author: Jonathan Brassow <jbrassow> Date: Thu Dec 3 15:48:04 2009 -0600 cmirror: Fix-up init script behaviour (bug 520915) init script was throwing errors on 'stop' when it shouldn't have been.
I have not run into this during recent RHEL 5.5 regression runs.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0307.html