Created attachment 332822 [details] patch. Description of problem: Expected behaviour of clusvcadm -e <service> -F is to follow failover domain rules, i.e. try to start a service on the node with the lowest priority first, then, if unsuccessful, try to start in on node with ++priority and so on. Instead, if first node was unable to start the service, clusvcadm falsely reports it was started on some other node in failover domain, but service is not started. This is caused by putting of service into the "recovering" state by the first node and not dealing with this by following. Version-Release number of selected component (if applicable): Bug was initially found in rgmanager-2.0.38, still present in 2.0.46 How reproducible: Always Steps to Reproduce: 1. Install 3(at least) nodes cluster (node01,node02,node03) 2. Define restricted failover domain test_domain (node01 - priority 1, node02 - priority 2) 3. Define a service test_service for this failover domain 4. Make a service unable to start on node01 (unmount fs, unistall service etc.) 5. Try "clusvcadm -e service:test_service -F" on node01 Actual results: clusvcadm reports that service was started on node02, but it's a lie. Expected results: service is started on node02 Additional info: On node01: <error> Starting Service test_service > Failed [30964] notice: start on service "test_service" returned 1 (generic error) [30964] warning: #68: Failed to start service:test_service; return value: 1 [30964] notice: Stopping service service:test_service <debug> Verifying Configuration Of service:test_service <info> Stopping Service service:test_service <error> Monitoring Service service:test_service > Service Is Not Running <info> Stopping Service service:test_service > Succeed [30964] notice: Service service:test_service is recovering [30964] debug: Sent remote-start request to 2 [31038] debug: 1 events processed On node02: [29205] debug: Not starting service:test_service: recovery state [29204] debug: 1 events processed Attached patch makes clurgmgrd behave as expected. diff -U 3 -r ./rgmanager-2.0.38.orig/src/daemons/rg_state.c ./rgmanager-2.0.38/src/daemons/rg_state.c --- ./rgmanager-2.0.38.orig/src/daemons/rg_state.c 2008-03-27 21:12:36.000000000 +0100 +++ ./rgmanager-2.0.38/src/daemons/rg_state.c 2009-02-19 02:24:00.000000000 +0100 @@ -2061,7 +2061,11 @@ ret = RG_EFAIL; goto out; } else { - ret = svc_start_remote(svcName, RG_START_REMOTE, target); + if (request == RG_ENABLE) { + ret = svc_start_remote(svcName, RG_START_RECOVER, target); + } else { + ret = svc_start_remote(svcName, RG_START_REMOTE, target); + } } switch(ret) {
Proposed patch still doesn't work under certain circumstances. Here is a revised one: diff -U 3 -r ./rgmanager-2.0.38.orig/src/daemons/rg_state.c ./rgmanager-2.0.38/src/daemons/rg_state.c --- ./rgmanager-2.0.38.orig/src/daemons/rg_state.c 2008-03-27 21:12:36.000000000 +0100 +++ ./rgmanager-2.0.38/src/daemons/rg_state.c 2009-02-23 15:04:00.000000000 +0100 @@ -2054,14 +2054,14 @@ target = best_target_node(allowed_nodes, 0, svcName, 1); if (target == me) { - ret = handle_start_remote_req(svcName, request); + ret = handle_start_remote_req(svcName, (request==RG_ENABLE?RG_START_RECOVER:request)); if (ret == RG_EAGAIN) goto out; } else if (target < 0) { ret = RG_EFAIL; goto out; } else { - ret = svc_start_remote(svcName, RG_START_REMOTE, target); + ret = svc_start_remote(svcName, (request==RG_ENABLE?RG_START_RECOVER:RG_START_REMOTE), target); } switch(ret) {
Created attachment 332934 [details] patch.
*** Bug 486711 has been marked as a duplicate of this bug. ***
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=c0de9cfb54b5e3a8e0de4b95ae80d2ce5dae4aae Pushed to RHEL5 / master / STABLE2 / STABLE3
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1339.html