Description of problem: rgmanager shutdown hangs and never completes. Version-Release number of selected component (if applicable): rgmanager-2.99.12-2.fc10.i386 How reproducible: Very. Steps to Reproduce: 1. define a cluster 2. start one machine in the non-quorate cluster 3. run "service rgmanager stop" or "reboot". Or "halt". Actual results: it sits there until interrupted. Expected results: System to stop clurgmgrd, either politely, or via kill -9 after a few hours. I see two things that should change: 1. clurgmgrd should terminate cleanly when requested. 2. the initscript should timeout and kill it if it fails to exit properly. Additional info: Looking at the /etc/init.d/rgmanager script, I see the inner while loop which will just wait forever. Which is what it does when(if?) clurgmgrd doesn't exit on -TERM. Also, clurgmgrd ignores -TERM signals. # # Bring down the cluster on a node. # stop_cluster() { kill -TERM `pidof $RGMGRD` while [ 0 ]; do if [ -n "`pidof $RGMGRD`" ]; then echo -n $"Waiting for services to stop: " while [ -n "`pidof $RGMGRD`" ]; do sleep 1 done echo_success echo else echo $"Services are stopped." fi # Ensure all NFS rmtab daemons are dead. killall $RMTABD &> /dev/null rm -f /var/run/$RGMGRD.pid return 0 done } If I look deeper, into cluster-2.99.12/rgmanager/src/daemons/main.c I see that on a TERM signal it exits event_loop(...) (lines 776-778), but that falls back to a while() loop in main() (lines 1095-1106); nothing sets any flags, such as running to 0 or shutdown_pending to 1. There is only one place shutdown_pending is set: In main.c flag_shutdown() (790-794); This is bound to the -TERM and -INT handlers. The shutdown_pending is only checked once the main loop is found. If the signal is received before the cluster is quorate, nothing checks for the shutdown_pending flag. clu_initialize() should be modified like this: void clu_initialize(cman_handle_t *ch) { if (!ch) exit(1); *ch = cman_init(NULL); if (!(*ch)) { log_printf(LOG_NOTICE, "Waiting for CMAN to start\n"); - while (!(*ch = cman_init(NULL))) { + while (!(*ch = cman_init(NULL)) && shutdown_pending == 0 ) { sleep(1); } } + if (shutdown_pending > 0 ) { + return; + } if (!cman_is_quorate(*ch)) { /* There are two ways to do this; this happens to be the simpler of the two. The other method is to join with a NULL group and log in -- this will cause the plugin to not select any node group (if any exist). */ log_printf(LOG_NOTICE, "Waiting for quorum to form\n"); - while (cman_is_quorate(*ch) == 0 ) { + while (cman_is_quorate(*ch) == 0 && shutdown_pending == 0 ) { sleep(1); } + if (shutdown_pending > 0 ) { + return; + } log_printf(LOG_NOTICE, "Quorum formed\n"); } }
Oh, and this in main(): clu_initialize(&clu); + if (shutdown_pending > 0 ) { + log_printf(LOG_NOTICE, "shutdown durring clu_initialize\n"); + return(-1); + } if (cman_init_subsys(clu) < 0) { perror("cman_init_subsys"); return -1; }
This was fixed in STABLE3 several weeks ago and will appear in rawhide after the next spin: http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=e5cc0c0302e744694ef981290fb33b97ae635371