486717 – clusvcadm -e <service> -F handling bugs

Bug 486717 - clusvcadm -e <service> -F handling bugs

Summary: clusvcadm -e <service> -F handling bugs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	486711 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-02-21 12:15 UTC by Yevheniy Demchenko
Modified:	2009-09-02 11:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 11:04:53 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch. (610 bytes, patch) 2009-02-21 12:15 UTC, Yevheniy Demchenko	no flags	Details \| Diff
patch. (882 bytes, patch) 2009-02-23 14:29 UTC, Yevheniy Demchenko	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1339	0	normal	SHIPPED_LIVE	Low: rgmanager security, bug fix, and enhancement update	2009-09-01 10:42:29 UTC

Description Yevheniy Demchenko 2009-02-21 12:15:33 UTC

Created attachment 332822 [details]
patch.

Description of problem:
Expected behaviour of clusvcadm -e <service> -F is to follow failover domain rules, i.e. try to start a service on the node with the lowest priority first, then, if unsuccessful, try to start in on node with ++priority and so on. Instead, if first node was unable to start the service, clusvcadm falsely reports it was started on some other node in failover domain, but service is not started.
This is caused by putting of service into the "recovering" state by the first node and not dealing with this by following.

Version-Release number of selected component (if applicable):
Bug was initially found in rgmanager-2.0.38, still present in 2.0.46


How reproducible:

Always
Steps to Reproduce:
1. Install 3(at least) nodes cluster (node01,node02,node03)
2. Define restricted failover domain test_domain (node01 - priority 1, node02 - priority 2)
3. Define a service test_service for this failover domain
4. Make a service unable to start on node01 (unmount fs, unistall service etc.)
5. Try "clusvcadm -e service:test_service -F" on node01
  
Actual results:
clusvcadm reports that service was started on node02, but it's a lie.

Expected results:
service is started on node02

Additional info:
On node01:
<error>  Starting Service test_service > Failed
[30964] notice: start on service "test_service" returned 1 (generic error)
[30964] warning: #68: Failed to start service:test_service; return value: 1
[30964] notice: Stopping service service:test_service
<debug>  Verifying Configuration Of service:test_service
<info>   Stopping Service service:test_service
<error>  Monitoring Service service:test_service > Service Is Not Running
<info>   Stopping Service service:test_service > Succeed
[30964] notice: Service service:test_service is recovering
[30964] debug: Sent remote-start request to 2
[31038] debug: 1 events processed

On node02:
[29205] debug: Not starting service:test_service: recovery state
[29204] debug: 1 events processed

Attached patch makes clurgmgrd behave as expected. 
diff -U 3 -r ./rgmanager-2.0.38.orig/src/daemons/rg_state.c ./rgmanager-2.0.38/src/daemons/rg_state.c
--- ./rgmanager-2.0.38.orig/src/daemons/rg_state.c	2008-03-27 21:12:36.000000000 +0100
+++ ./rgmanager-2.0.38/src/daemons/rg_state.c	2009-02-19 02:24:00.000000000 +0100
@@ -2061,7 +2061,11 @@
 			ret = RG_EFAIL;
 			goto out;
 		} else {
-			ret = svc_start_remote(svcName, RG_START_REMOTE, target);
+			if (request == RG_ENABLE) {
+			    ret = svc_start_remote(svcName, RG_START_RECOVER, target);
+			} else {
+			    ret = svc_start_remote(svcName, RG_START_REMOTE, target);
+			}
 		}
 
 		switch(ret) {

Comment 1 Yevheniy Demchenko 2009-02-23 14:26:17 UTC

Proposed patch still doesn't work under certain circumstances. Here is a revised one:
diff -U 3 -r ./rgmanager-2.0.38.orig/src/daemons/rg_state.c ./rgmanager-2.0.38/src/daemons/rg_state.c
--- ./rgmanager-2.0.38.orig/src/daemons/rg_state.c      2008-03-27 21:12:36.000000000 +0100
+++ ./rgmanager-2.0.38/src/daemons/rg_state.c   2009-02-23 15:04:00.000000000 +0100
@@ -2054,14 +2054,14 @@
                target = best_target_node(allowed_nodes, 0,
                                          svcName, 1);
                if (target == me) {
-                       ret = handle_start_remote_req(svcName, request);
+                       ret = handle_start_remote_req(svcName, (request==RG_ENABLE?RG_START_RECOVER:request));
                        if (ret == RG_EAGAIN)
                                goto out;
               } else if (target < 0) {
                        ret = RG_EFAIL;
                        goto out;
                } else {
-                       ret = svc_start_remote(svcName, RG_START_REMOTE, target);
+                           ret = svc_start_remote(svcName, (request==RG_ENABLE?RG_START_RECOVER:RG_START_REMOTE), target);
                }

                switch(ret) {

Comment 2 Yevheniy Demchenko 2009-02-23 14:29:10 UTC

Created attachment 332934 [details]
patch.

Comment 3 Lon Hohberger 2009-02-26 20:16:26 UTC

*** Bug 486711 has been marked as a duplicate of this bug. ***

Comment 4 Lon Hohberger 2009-02-27 15:10:19 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=c0de9cfb54b5e3a8e0de4b95ae80d2ce5dae4aae

Pushed to RHEL5 / master / STABLE2 / STABLE3

Comment 7 errata-xmlrpc 2009-09-02 11:04:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html

Note You need to log in before you can comment on or make changes to this bug.