Bug 486717 - clusvcadm -e <service> -F handling bugs
Summary: clusvcadm -e <service> -F handling bugs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.2
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
: 486711 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-02-21 12:15 UTC by Yevheniy Demchenko
Modified: 2009-09-02 11:04 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 11:04:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch. (610 bytes, patch)
2009-02-21 12:15 UTC, Yevheniy Demchenko
no flags Details | Diff
patch. (882 bytes, patch)
2009-02-23 14:29 UTC, Yevheniy Demchenko
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1339 0 normal SHIPPED_LIVE Low: rgmanager security, bug fix, and enhancement update 2009-09-01 10:42:29 UTC

Description Yevheniy Demchenko 2009-02-21 12:15:33 UTC
Created attachment 332822 [details]
patch.

Description of problem:
Expected behaviour of clusvcadm -e <service> -F is to follow failover domain rules, i.e. try to start a service on the node with the lowest priority first, then, if unsuccessful, try to start in on node with ++priority and so on. Instead, if first node was unable to start the service, clusvcadm falsely reports it was started on some other node in failover domain, but service is not started.
This is caused by putting of service into the "recovering" state by the first node and not dealing with this by following.

Version-Release number of selected component (if applicable):
Bug was initially found in rgmanager-2.0.38, still present in 2.0.46


How reproducible:

Always
Steps to Reproduce:
1. Install 3(at least) nodes cluster (node01,node02,node03)
2. Define restricted failover domain test_domain (node01 - priority 1, node02 - priority 2)
3. Define a service test_service for this failover domain
4. Make a service unable to start on node01 (unmount fs, unistall service etc.)
5. Try "clusvcadm -e service:test_service -F" on node01
  
Actual results:
clusvcadm reports that service was started on node02, but it's a lie.

Expected results:
service is started on node02

Additional info:
On node01:
<error>  Starting Service test_service > Failed
[30964] notice: start on service "test_service" returned 1 (generic error)
[30964] warning: #68: Failed to start service:test_service; return value: 1
[30964] notice: Stopping service service:test_service
<debug>  Verifying Configuration Of service:test_service
<info>   Stopping Service service:test_service
<error>  Monitoring Service service:test_service > Service Is Not Running
<info>   Stopping Service service:test_service > Succeed
[30964] notice: Service service:test_service is recovering
[30964] debug: Sent remote-start request to 2
[31038] debug: 1 events processed

On node02:
[29205] debug: Not starting service:test_service: recovery state
[29204] debug: 1 events processed

Attached patch makes clurgmgrd behave as expected. 
diff -U 3 -r ./rgmanager-2.0.38.orig/src/daemons/rg_state.c ./rgmanager-2.0.38/src/daemons/rg_state.c
--- ./rgmanager-2.0.38.orig/src/daemons/rg_state.c	2008-03-27 21:12:36.000000000 +0100
+++ ./rgmanager-2.0.38/src/daemons/rg_state.c	2009-02-19 02:24:00.000000000 +0100
@@ -2061,7 +2061,11 @@
 			ret = RG_EFAIL;
 			goto out;
 		} else {
-			ret = svc_start_remote(svcName, RG_START_REMOTE, target);
+			if (request == RG_ENABLE) {
+			    ret = svc_start_remote(svcName, RG_START_RECOVER, target);
+			} else {
+			    ret = svc_start_remote(svcName, RG_START_REMOTE, target);
+			}
 		}
 
 		switch(ret) {

Comment 1 Yevheniy Demchenko 2009-02-23 14:26:17 UTC
Proposed patch still doesn't work under certain circumstances. Here is a revised one:
diff -U 3 -r ./rgmanager-2.0.38.orig/src/daemons/rg_state.c ./rgmanager-2.0.38/src/daemons/rg_state.c
--- ./rgmanager-2.0.38.orig/src/daemons/rg_state.c      2008-03-27 21:12:36.000000000 +0100
+++ ./rgmanager-2.0.38/src/daemons/rg_state.c   2009-02-23 15:04:00.000000000 +0100
@@ -2054,14 +2054,14 @@
                target = best_target_node(allowed_nodes, 0,
                                          svcName, 1);
                if (target == me) {
-                       ret = handle_start_remote_req(svcName, request);
+                       ret = handle_start_remote_req(svcName, (request==RG_ENABLE?RG_START_RECOVER:request));
                        if (ret == RG_EAGAIN)
                                goto out;
               } else if (target < 0) {
                        ret = RG_EFAIL;
                        goto out;
                } else {
-                       ret = svc_start_remote(svcName, RG_START_REMOTE, target);
+                           ret = svc_start_remote(svcName, (request==RG_ENABLE?RG_START_RECOVER:RG_START_REMOTE), target);
                }

                switch(ret) {

Comment 2 Yevheniy Demchenko 2009-02-23 14:29:10 UTC
Created attachment 332934 [details]
patch.

Comment 3 Lon Hohberger 2009-02-26 20:16:26 UTC
*** Bug 486711 has been marked as a duplicate of this bug. ***

Comment 4 Lon Hohberger 2009-02-27 15:10:19 UTC
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=c0de9cfb54b5e3a8e0de4b95ae80d2ce5dae4aae

Pushed to RHEL5 / master / STABLE2 / STABLE3

Comment 7 errata-xmlrpc 2009-09-02 11:04:53 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html


Note You need to log in before you can comment on or make changes to this bug.