+++ This bug was initially created as a clone of Bug #449394 +++ Take a simple service group using the default recovery policy (restart): <service name="service01"> <ip address="192.168.40.110" monitor_link="1"/> <apache name="apache01"/> </service> If the "ip" resource fails then the service is firstly restarted on the same node (and if it fails to restart it will be started on the other available nodes). But this doesn't happen: if, during the recovery on the same node, ip cannot start (ex. failed link) then the service is stopped but the stop fails as re-skelet.sh:stop_generic() returns an error as the pid file is not present and the service goes in a failed state. The pid files doesn't (correctly) exists because rgmanager never started the resource "apache01" because it failed to start ip (ip starts before apache01). This isn't the expected behavior: *) Fail link on eth0 of node01: rgmanager notices that the ip resource is failed <debug> Checking 192.168.40.110, Level 0 <debug> 192.168.40.110 present on eth0 <warn> Link for eth0: Not detected <warn> No link on eth0... [14647] notice: status on ip "192.168.40.110" returned 1 (generic error) *) rgmanager stops service01 on node01 [14647] notice: Stopping service service:service01 <debug> Verifying Configuration Of apache:apache01 <debug> Checking Syntax Of The File /etc/httpd/conf/httpd.conf <debug> Checking Syntax Of The File > Succeed <info> Stopping Service apache:apache01 <info> Stopping Service apache:apache01 > Succeed <info> Removing IPv4 address 192.168.40.110/24 from eth0 *) rgmanager tries to restart service01 on node01 [14647] notice: Service service:service01 is recovering [14647] notice: Recovering failed service service:service01 <warn> Link for eth0: Not detected [14647] notice: start on ip "192.168.40.110" returned 1 (generic error) *) rgmanager cannot restart service01 as the eth0 link is down and start of ip resource fails so it stops service01 before trying to relocate it on another node [14647] warning: #68: Failed to start service:service01; return value: 1 [14647] debug: Stopping failed service service:service01 [14647] notice: Stopping service service:service01 <debug> Verifying Configuration Of apache:apache01 <debug> Checking Syntax Of The File /etc/httpd/conf/httpd.conf <debug> Checking Syntax Of The File > Succeed <info> Stopping Service apache:apache01 <error> Checking Existence Of File /var/run/cluster/apache/apache:apache01.pid [apache:apache01] > Failed - File Doesn't Exist <error> Stopping Service apache:apache01 > Failed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *) The stop fails as the pid file doesn't exists (the pid doesn't correctly exists as apache01 was never started) [14647] notice: stop on apache "apache01" returned 1 (generic error) <debug> 192.168.40.110 is not configured [14647] crit: #12: RG service:service01 failed to stop; intervention required [14647] notice: Service service:service01 is failed [14647] crit: #13: Service service:service01 failed to stop cleanly [14647] debug: Handling failure request for RG service:service01 I can see two different possible "solutions": 1) Not assume that the non existence of a pid file is an error (simply assume that the service is not running if the pid isn't present). --- /usr/share/cluster/utils/ra-skelet.sh.orig 2008-06-03 02:07:04.000000000 +0200 +++ /usr/share/cluster/utils/ra-skelet.sh 2008-06-03 04:10:41.000000000 +0200 @@ -52,7 +52,7 @@ if [ ! -e "$pid_file" ]; then clog_check_file_exist $CLOG_FAILED_NOT_FOUND "$pid_file" - return $OCF_ERR_GENERIC + return 0 fi if [ -z "$kill_timeout" ]; then Maybe there's a real reason to return an error if the pid is not existing. By now the unique problem I can think of about this change is if someone removes by hand the pid file and then the stop is launched (but if someone removes the file by hand then he's looking for troubles). If the change above is not possibile then another solution should be: 2) Change rgmanager to stop only resources that it "correctly" started before. In the above example as the start of "ip01" failed it should not stop anything (also not ip01) as nothing was started correctly. Now, I don't know if this behavior will be correct. Is the stop of a resource always launched (also if its start fails) with the intent to clean a dirty resource status that its start can create? If this is the intent then this solutions isn't possible (for example, if "ip01" starts correctly but "apache01" fails to start then the stop of apache01 will also fail as the pid file probably doesn't exists). In this case the only working solutions will be 1). --- Additional comment from lhh on 2008-06-09 11:05:43 EDT --- Stop-after-stop and stop-when-not-running should always work per LSB specification.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1048.html