Bug 480831 - Recovery policy of type restart doesn't work with a service using a resource based on ra-skelet.sh
Summary: Recovery policy of type restart doesn't work with a service using a resource ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: rgmanager
Version: 4
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: Marek Grac
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 449394
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-01-20 19:43 UTC by Marek Grac
Modified: 2009-05-18 21:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 449394
Environment:
Last Closed: 2009-05-18 21:13:39 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1048 0 normal SHIPPED_LIVE rgmanager bug-fix and enhancement update 2009-05-18 21:12:29 UTC

Description Marek Grac 2009-01-20 19:43:35 UTC
+++ This bug was initially created as a clone of Bug #449394 +++

Take a simple service group using the default recovery policy (restart):

<service name="service01">
	<ip address="192.168.40.110" monitor_link="1"/>
	<apache name="apache01"/>
</service>

If the "ip" resource fails then the service is firstly restarted on the same
node (and if it fails to restart it will be started on the other available nodes).

But this doesn't happen:

if, during the recovery on the same node, ip cannot start (ex. failed link) then
the service is stopped but the stop fails as re-skelet.sh:stop_generic() returns
an error as the pid file is not present and the service goes in a failed state.
The pid files doesn't (correctly) exists because rgmanager never started the
resource "apache01" because it failed to start ip (ip starts before apache01).
This isn't the expected behavior:

*) Fail link on eth0 of node01: rgmanager notices that the ip resource is failed

<debug>  Checking 192.168.40.110, Level 0
<debug>  192.168.40.110 present on eth0
<warn>   Link for eth0: Not detected
<warn>   No link on eth0...
[14647] notice: status on ip "192.168.40.110" returned 1 (generic error)

*) rgmanager stops service01 on node01

[14647] notice: Stopping service service:service01
<debug>  Verifying Configuration Of apache:apache01
<debug>  Checking Syntax Of The File /etc/httpd/conf/httpd.conf
<debug>  Checking Syntax Of The File  > Succeed
<info>   Stopping Service apache:apache01
<info>   Stopping Service apache:apache01 > Succeed
<info>   Removing IPv4 address 192.168.40.110/24 from eth0

*) rgmanager tries to restart service01 on node01

[14647] notice: Service service:service01 is recovering
[14647] notice: Recovering failed service service:service01
<warn>   Link for eth0: Not detected
[14647] notice: start on ip "192.168.40.110" returned 1 (generic error)

*) rgmanager cannot restart service01 as the eth0 link is down and start of ip
resource fails so it stops service01 before trying to relocate it on another node

[14647] warning: #68: Failed to start service:service01; return value: 1
[14647] debug: Stopping failed service service:service01
[14647] notice: Stopping service service:service01
<debug>  Verifying Configuration Of apache:apache01
<debug>  Checking Syntax Of The File /etc/httpd/conf/httpd.conf
<debug>  Checking Syntax Of The File  > Succeed
<info>   Stopping Service apache:apache01
<error>  Checking Existence Of File /var/run/cluster/apache/apache:apache01.pid
[apache:apache01] > Failed - File Doesn't Exist
<error>  Stopping Service apache:apache01 > Failed

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*) The stop fails as the pid file doesn't exists (the pid doesn't correctly
exists as apache01 was never started)

[14647] notice: stop on apache "apache01" returned 1 (generic error)
<debug>  192.168.40.110 is not configured
[14647] crit: #12: RG service:service01 failed to stop; intervention required
[14647] notice: Service service:service01 is failed
[14647] crit: #13: Service service:service01 failed to stop cleanly
[14647] debug: Handling failure request for RG service:service01

 

I can see two different possible "solutions":

1) Not assume that the non existence of a pid file is an error (simply assume
that the service is not running if the pid isn't present).

--- /usr/share/cluster/utils/ra-skelet.sh.orig  2008-06-03 02:07:04.000000000 +0200
+++ /usr/share/cluster/utils/ra-skelet.sh       2008-06-03 04:10:41.000000000 +0200
@@ -52,7 +52,7 @@

        if [ ! -e "$pid_file" ]; then
                clog_check_file_exist $CLOG_FAILED_NOT_FOUND "$pid_file"
-               return $OCF_ERR_GENERIC
+               return 0
        fi

        if [ -z "$kill_timeout" ]; then


Maybe there's a real reason to return an error if the pid is not existing. By
now the unique problem I can think of about this change is if someone removes by
hand the pid file and then the stop is launched (but if someone removes the file
by hand then he's looking for troubles).

If the change above is not possibile then another solution should be:

2) Change rgmanager to stop only resources that it "correctly" started before.

In the above example as the start of "ip01" failed it should not stop anything
(also not ip01) as nothing was started correctly.

Now, I don't know if this behavior will be correct.
Is the stop of a resource always launched (also if its start fails) with the
intent to clean a dirty resource status that its start can create?

If this is the intent then this solutions isn't possible (for example, if "ip01"
starts correctly but "apache01" fails to start then the stop of apache01 will
also fail as the pid file probably doesn't exists). In this case the only
working solutions will be 1).

--- Additional comment from lhh on 2008-06-09 11:05:43 EDT ---

Stop-after-stop and stop-when-not-running should always work per LSB specification.

Comment 3 errata-xmlrpc 2009-05-18 21:13:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1048.html


Note You need to log in before you can comment on or make changes to this bug.