497450 – rgmanager shutdown hangs if it hasn't formed a quorum

Bug 497450 - rgmanager shutdown hangs if it hasn't formed a quorum

Summary: rgmanager shutdown hangs if it hasn't formed a quorum

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rgmanager
Sub Component:
Version:	10
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-04-23 23:00 UTC by P Jones
Modified:	2009-07-22 20:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-07-22 20:38:28 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description P Jones 2009-04-23 23:00:20 UTC

Description of problem:
rgmanager shutdown hangs and never completes. 

Version-Release number of selected component (if applicable):
rgmanager-2.99.12-2.fc10.i386

How reproducible:
Very.

Steps to Reproduce:
1. define a cluster 
2. start one machine in the non-quorate cluster 
3. run "service rgmanager stop" or "reboot". Or "halt". 

Actual results:
it sits there until interrupted. 

Expected results:
System to stop clurgmgrd, either politely, or via kill -9 after a few hours. 
I see two things that should change:
 1. clurgmgrd should terminate cleanly when requested. 
 2. the initscript should timeout and kill it if it fails to exit properly. 


Additional info:
Looking at the /etc/init.d/rgmanager script, I see the inner while loop which will just wait forever.  Which is what it does when(if?) clurgmgrd doesn't exit on -TERM.  Also, clurgmgrd ignores -TERM signals. 

#
# Bring down the cluster on a node.
#
stop_cluster()
{
        kill -TERM `pidof $RGMGRD`

        while [ 0 ]; do

                if [ -n "`pidof $RGMGRD`" ]; then
                        echo -n $"Waiting for services to stop: " 
                        while [ -n "`pidof $RGMGRD`" ]; do
                                sleep 1
                        done
                        echo_success
                        echo
                else
                        echo $"Services are stopped."
                fi

                # Ensure all NFS rmtab daemons are dead.
                killall $RMTABD &> /dev/null
                
                rm -f /var/run/$RGMGRD.pid

                return 0
        done
}

If I look deeper, into cluster-2.99.12/rgmanager/src/daemons/main.c I see that on a TERM signal it exits event_loop(...) (lines 776-778), but that falls back to a while() loop in main() (lines 1095-1106); nothing sets any flags, such as running to 0 or shutdown_pending to 1.  

There is only one place shutdown_pending is set: In main.c flag_shutdown() (790-794);  This is bound to the -TERM and -INT handlers. 

The shutdown_pending is only checked once the main loop is found.  
If the signal is received before the cluster is quorate,  nothing checks for the shutdown_pending flag. clu_initialize() should be modified like this:

void
clu_initialize(cman_handle_t *ch)
{
	if (!ch)
		exit(1);

	*ch = cman_init(NULL);
	if (!(*ch)) {
		log_printf(LOG_NOTICE, "Waiting for CMAN to start\n");

-		while (!(*ch = cman_init(NULL))) {
+		while (!(*ch = cman_init(NULL)) && shutdown_pending == 0 ) {
			sleep(1);
		}
	}

+        if (shutdown_pending > 0 ) { 
+            return;
+        }
        if (!cman_is_quorate(*ch)) {
		/*
		   There are two ways to do this; this happens to be the simpler
		   of the two.  The other method is to join with a NULL group 
		   and log in -- this will cause the plugin to not select any
		   node group (if any exist).
		 */
		log_printf(LOG_NOTICE, "Waiting for quorum to form\n");

-		while (cman_is_quorate(*ch) == 0 ) {
+		while (cman_is_quorate(*ch) == 0 && shutdown_pending == 0 ) {
			sleep(1);
		}
+                if (shutdown_pending > 0 ) { 
+                    return;
+                }
		log_printf(LOG_NOTICE, "Quorum formed\n");
	}

}

Comment 1 P Jones 2009-04-24 04:03:21 UTC

Oh, and this in main():

	clu_initialize(&clu);
+        if (shutdown_pending > 0 ) { 
+	    log_printf(LOG_NOTICE, "shutdown durring clu_initialize\n");
+            return(-1);
+        }

	if (cman_init_subsys(clu) < 0) {
		perror("cman_init_subsys");
		return -1;
	}

Comment 2 Lon Hohberger 2009-04-28 18:05:01 UTC

This was fixed in STABLE3 several weeks ago and will appear in rawhide after the next spin:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=e5cc0c0302e744694ef981290fb33b97ae635371

Note You need to log in before you can comment on or make changes to this bug.