768146 – Make rgmanager ignore OCF_RA_NOT_INSTALLED on script resources during stop-on-stopped

Bug 768146 - Make rgmanager ignore OCF_RA_NOT_INSTALLED on script resources during stop-on-stopped

Summary: Make rgmanager ignore OCF_RA_NOT_INSTALLED on script resources during stop-on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	5.9
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	772956 853251
TreeView+	depends on / blocked

Reported:	2011-12-15 21:20 UTC by Adam Drew
Modified:	2018-11-26 18:14 UTC (History)
CC List:	5 users (show)
Fixed In Version:	rgmanager-2.0.52-25.el5
Doc Type:	Bug Fix
Doc Text:	Cause: Rgmanager would call 'stop' on a script which is on a file system which has not been mounted, so the call fails. Consequence: Resources which fail to stop cause the whole service to go in to the 'failed' state, preventing the cluster from further recovery. Fix: Make rgmanager treat 'missing' during the 'stop' phase when we know that there never was a previous 'start' be non-fatal. Result: Service no longer enters the failed state as a result of the script being missing.
Clone Of:
Clones:	853251 (view as bug list)
Environment:
Last Closed:	2012-02-21 06:19:51 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Upstream patch that resolves the issue (1.66 KB, application/octet-stream) 2011-12-15 21:25 UTC, Adam Drew	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2012:0163	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2012-02-20 15:07:03 UTC

Description Adam Drew 2011-12-15 21:20:29 UTC

Description of problem:
If a script is a child of the filesystem is resides on we do a stop-on-stopped on the tree we'll end up with OCF_RA_NOT_INSTALLED from the script and disable the service. 

Version-Release number of selected component (if applicable):
rgmanager-2.0.52-21.el5.x86_64

Comment 2 Adam Drew 2011-12-15 21:25:58 UTC

Created attachment 547484 [details]
Upstream patch that resolves the issue

Comment 3 Lon Hohberger 2011-12-16 14:53:36 UTC

Note: Not actually upstream yet.

Comment 6 Adam Drew 2011-12-16 15:46:11 UTC

Verified:

                <service name="script-test">
                        <ip address="192.168.122.233" monitor_link="1">
                                <fs name="fs" type="ext3" device="/dev/mapper/mpath1p1" mountpoint="/mnt/ext3" force_unmount="1">
                                        <script name="script" file="/mnt/ext3/test.sh"/>
                                </fs>
                        </ip>
                </service>

rgmanager-2.0.52-21.el5:
Dec 16 10:39:47 node1 clurgmgrd[29921]: <notice> status on ip "192.168.122.233" returned 1 (generic error) 
Dec 16 10:39:47 node1 clurgmgrd[29921]: <notice> Stopping service service:script-test 
Dec 16 10:39:47 node1 clurgmgrd: [29921]: <info> Executing /mnt/ext3/test.sh stop 
Dec 16 10:39:47 node1 logger: ended
Dec 16 10:39:47 node1 clurgmgrd: [29921]: <info> unmounting /mnt/ext3 
Dec 16 10:39:47 node1 multipathd: dm-4: umount map (uevent) 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <notice> Service service:script-test is recovering 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <notice> Recovering failed service service:script-test 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <notice> start on ip "192.168.122.233" returned 1 (generic error) 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <warning> #68: Failed to start service:script-test; return value: 1 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <notice> Stopping service service:script-test 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <notice> stop on script "script" returned 5 (program not installed) 
Dec 16 10:39:48 node1 clurgmgrd: [29921]: <info> /dev/mapper/mpath1p1 is not mounted 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <crit> #12: RG service:script-test failed to stop; intervention required 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <notice> Service service:script-test is failed 
Dec 16 10:39:48 node1 clurgmgrd[29921]: <crit> #13: Service service:script-test failed to stop cleanly 

With test package:
Dec 16 10:45:07 node1 clurgmgrd[2502]: <notice> status on ip "192.168.122.233" returned 1 (generic error) 
Dec 16 10:45:07 node1 clurgmgrd[2502]: <notice> Stopping service service:script-test 
Dec 16 10:45:07 node1 clurgmgrd: [2502]: <info> Executing /mnt/ext3/test.sh stop 
Dec 16 10:45:07 node1 logger: ended
Dec 16 10:45:08 node1 clurgmgrd: [2502]: <info> unmounting /mnt/ext3 
Dec 16 10:45:08 node1 multipathd: dm-4: umount map (uevent) 
Dec 16 10:45:08 node1 clurgmgrd[2502]: <notice> Service service:script-test is recovering 
Dec 16 10:45:08 node1 clurgmgrd[2502]: <notice> Recovering failed service service:script-test 
Dec 16 10:45:08 node1 clurgmgrd[2502]: <notice> start on ip "192.168.122.233" returned 1 (generic error) 
Dec 16 10:45:08 node1 clurgmgrd[2502]: <warning> #68: Failed to start service:script-test; return value: 1 
Dec 16 10:45:08 node1 clurgmgrd[2502]: <notice> Stopping service service:script-test 
Dec 16 10:45:09 node1 clurgmgrd: [2502]: <info> /dev/mapper/mpath1p1 is not mounted 
Dec 16 10:45:09 node1 clurgmgrd[2502]: <notice> Service service:script-test is recovering 
Dec 16 10:45:09 node1 clurgmgrd[2502]: <warning> #71: Relocating failed service service:script-test 
Dec 16 10:45:13 node1 clurgmgrd[2502]: <notice> Service service:script-test is now running on member 2

Comment 9 Lon Hohberger 2012-01-19 19:29:51 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Rgmanager would call 'stop' on a script which is on a file system which has not been mounted, so the call fails.

Consequence: Resources which fail to stop cause the whole service to go in to the 'failed' state, preventing the cluster from further recovery.

Fix: Make rgmanager treat 'missing' during the 'stop' phase when we know that there never was a previous 'start' be non-fatal.

Result: Service no longer enters the failed state as a result of the script being missing.

Comment 10 John Ruemker 2012-01-30 22:21:42 UTC

Customer verifies that the test package resolves their issue.  IP resource fails as a result of link failure, service stops successfully, goes into recovery and fails again, stops successfully again, and relocates to another node.

I've also verified this behavior with rgmanager-2.0.52-25.el5 in my test environment.

-John

Comment 11 errata-xmlrpc 2012-02-21 06:19:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0163.html

Note You need to log in before you can comment on or make changes to this bug.