Bug 449394

Summary:	Recovery policy of type restart doesn't work with a service using a resource based on ra-skelet.sh
Product:	Red Hat Enterprise Linux 5	Reporter:	Simone Gotti <simone.gotti>
Component:	rgmanager	Assignee:	Marek Grac <mgrac>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	low	Docs Contact:
Priority:	low
Version:	5.2	CC:	cluster-maint, cward, edamato, hklein, sbradley, tao, yamato
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	480831 (view as bug list)		Environment:
Last Closed:	2009-09-02 11:05:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	480831

Description Simone Gotti 2008-06-02 16:02:25 UTC

Take a simple service group using the default recovery policy (restart):

<service name="service01">
	<ip address="192.168.40.110" monitor_link="1"/>
	<apache name="apache01"/>
</service>

If the "ip" resource fails then the service is firstly restarted on the same
node (and if it fails to restart it will be started on the other available nodes).

But this doesn't happen:

if, during the recovery on the same node, ip cannot start (ex. failed link) then
the service is stopped but the stop fails as re-skelet.sh:stop_generic() returns
an error as the pid file is not present and the service goes in a failed state.
The pid files doesn't (correctly) exists because rgmanager never started the
resource "apache01" because it failed to start ip (ip starts before apache01).
This isn't the expected behavior:

*) Fail link on eth0 of node01: rgmanager notices that the ip resource is failed

<debug>  Checking 192.168.40.110, Level 0
<debug>  192.168.40.110 present on eth0
<warn>   Link for eth0: Not detected
<warn>   No link on eth0...
[14647] notice: status on ip "192.168.40.110" returned 1 (generic error)

*) rgmanager stops service01 on node01

[14647] notice: Stopping service service:service01
<debug>  Verifying Configuration Of apache:apache01
<debug>  Checking Syntax Of The File /etc/httpd/conf/httpd.conf
<debug>  Checking Syntax Of The File  > Succeed
<info>   Stopping Service apache:apache01
<info>   Stopping Service apache:apache01 > Succeed
<info>   Removing IPv4 address 192.168.40.110/24 from eth0

*) rgmanager tries to restart service01 on node01

[14647] notice: Service service:service01 is recovering
[14647] notice: Recovering failed service service:service01
<warn>   Link for eth0: Not detected
[14647] notice: start on ip "192.168.40.110" returned 1 (generic error)

*) rgmanager cannot restart service01 as the eth0 link is down and start of ip
resource fails so it stops service01 before trying to relocate it on another node

[14647] warning: #68: Failed to start service:service01; return value: 1
[14647] debug: Stopping failed service service:service01
[14647] notice: Stopping service service:service01
<debug>  Verifying Configuration Of apache:apache01
<debug>  Checking Syntax Of The File /etc/httpd/conf/httpd.conf
<debug>  Checking Syntax Of The File  > Succeed
<info>   Stopping Service apache:apache01
<error>  Checking Existence Of File /var/run/cluster/apache/apache:apache01.pid
[apache:apache01] > Failed - File Doesn't Exist
<error>  Stopping Service apache:apache01 > Failed

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*) The stop fails as the pid file doesn't exists (the pid doesn't correctly
exists as apache01 was never started)

[14647] notice: stop on apache "apache01" returned 1 (generic error)
<debug>  192.168.40.110 is not configured
[14647] crit: #12: RG service:service01 failed to stop; intervention required
[14647] notice: Service service:service01 is failed
[14647] crit: #13: Service service:service01 failed to stop cleanly
[14647] debug: Handling failure request for RG service:service01

 

I can see two different possible "solutions":

1) Not assume that the non existence of a pid file is an error (simply assume
that the service is not running if the pid isn't present).

--- /usr/share/cluster/utils/ra-skelet.sh.orig  2008-06-03 02:07:04.000000000 +0200
+++ /usr/share/cluster/utils/ra-skelet.sh       2008-06-03 04:10:41.000000000 +0200
@@ -52,7 +52,7 @@

        if [ ! -e "$pid_file" ]; then
                clog_check_file_exist $CLOG_FAILED_NOT_FOUND "$pid_file"
-               return $OCF_ERR_GENERIC
+               return 0
        fi

        if [ -z "$kill_timeout" ]; then


Maybe there's a real reason to return an error if the pid is not existing. By
now the unique problem I can think of about this change is if someone removes by
hand the pid file and then the stop is launched (but if someone removes the file
by hand then he's looking for troubles).

If the change above is not possibile then another solution should be:

2) Change rgmanager to stop only resources that it "correctly" started before.

In the above example as the start of "ip01" failed it should not stop anything
(also not ip01) as nothing was started correctly.

Now, I don't know if this behavior will be correct.
Is the stop of a resource always launched (also if its start fails) with the
intent to clean a dirty resource status that its start can create?

If this is the intent then this solutions isn't possible (for example, if "ip01"
starts correctly but "apache01" fails to start then the stop of apache01 will
also fail as the pid file probably doesn't exists). In this case the only
working solutions will be 1).

Comment 1 Lon Hohberger 2008-06-09 15:05:43 UTC

Stop-after-stop and stop-when-not-running should always work per LSB specification.

Comment 2 Marek Grac 2009-01-20 19:48:27 UTC

I believe that second solution is better but in resource agent script we don't
know the previous state of service (it is already in 'starting'). So first
solution with patch was used.

Thanks for bug and proposed patch.

Comment 5 Lon Hohberger 2009-08-21 20:58:12 UTC

*** Bug 517861 has been marked as a duplicate of this bug. ***

Comment 6 Lon Hohberger 2009-08-21 21:05:19 UTC

*** Bug 512055 has been marked as a duplicate of this bug. ***

Comment 7 Lon Hohberger 2009-08-21 21:06:09 UTC

*** Bug 514039 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2009-09-02 11:05:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html