711521 – Dependencies in independent_tree resources does not work as expected

Bug 711521 - Dependencies in independent_tree resources does not work as expected

Summary: Dependencies in independent_tree resources does not work as expected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.6
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-06-07 17:59 UTC by Alfredo Moralejo
Modified:	2018-11-14 12:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:	rgmanager-2.0.52-21.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	716231 (view as bug list)
Environment:
Last Closed:	2011-07-21 10:44:25 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fix (1.10 KB, patch) 2011-06-23 14:37 UTC, Lon Hohberger	no flags	Details \| Diff
test1.sh from referenced service configurations. Place in /tmp. (74 bytes, application/x-shellscript) 2011-06-23 14:46 UTC, Lon Hohberger	no flags	Details
test2.sh from referenced service configurations. Place in /tmp. (83 bytes, application/x-shellscript) 2011-06-23 14:46 UTC, Lon Hohberger	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1000	0	normal	SHIPPED_LIVE	Low: rgmanager security, bug fix, and enhancement update	2011-07-21 10:43:18 UTC

Description Alfredo Moralejo 2011-06-07 17:59:15 UTC

Description of problem:

Two script resources (parent and sibbling) are included in a service with __independent_tree="1". 

With rgmanager version 2.0.52-9.el5, when child resource is detected as failed, both resources are restarted insted of only the child one.

With rgmanager version 2.0.52-6.el5_5.8 it works as expected.

Extract of cluster.conf:

		<service nfslock="1" autostart="1" domain="node1-first" exclusive="0" max_restarts="3" name="test" recovery="relocate" restart_expire_time="900">
			<script file="/root/test1" name="script1" __independent_subtree="1">
				<script file="/root/test2" name="script2" __independent_subtree="1"/>
			</script>
		</service>



Version-Release number of selected component (if applicable):

2.0.52-9.el5

How reproducible:

Always

Steps to Reproduce:
1. Create a service with parent and child resource and mark both with independent_tree to 1

2. make sibling resource to fail

  
Actual results:

Both parent and sibling (in the example script1 and script2) are restarted:

Jun  7 19:57:50 node1 clurgmgrd[14575]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun  7 19:57:51 node1 logger: stop test2
Jun  7 19:57:51 node1 logger: stop test1
Jun  7 19:57:51 node1 logger: start test1
Jun  7 19:57:51 node1 logger: start test2
Jun  7 19:57:51 node1 clurgmgrd[14575]: <notice> Inline recovery of service:test complete 


Expected results:

Only the child resource (script2) is restarted. Output with version 2.0.52-6.el5_5.8 


Jun  7 19:52:20 node1 clurgmgrd[11160]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun  7 19:52:20 node1 logger: stop test2
Jun  7 19:52:20 node1 logger: start test2
Jun  7 19:52:20 node1 clurgmgrd[11160]: <notice> Inline recovery of service:test succeeded 

Additional info:

Comment 2 Lon Hohberger 2011-06-23 14:18:52 UTC

Reproduced.

Comment 3 Lon Hohberger 2011-06-23 14:37:53 UTC

Created attachment 506331 [details]
Fix

Comment 5 Lon Hohberger 2011-06-23 14:44:47 UTC

Example service configuration:

                <service name="test">
                        <script name="a" file="/tmp/test1.sh" __independent_subtree="1">
                                <script name="b" file="/tmp/test2.sh" __independent_subtree="2"/>
                        </script>
                </service>

Comment 6 Lon Hohberger 2011-06-23 14:45:38 UTC

Oops, that's for regression testing against the non-critical services.  Here's the reproducer I used:

                <service name="test">
                        <script name="a" file="/tmp/test1.sh"
__independent_subtree="1">
                                <script name="b" file="/tmp/test2.sh"
__independent_subtree="1"/>
                        </script>
                </service>

Comment 7 Lon Hohberger 2011-06-23 14:46:14 UTC

Created attachment 506568 [details]
test1.sh from referenced service configurations.  Place in /tmp.

Comment 8 Lon Hohberger 2011-06-23 14:46:43 UTC

Created attachment 506583 [details]
test2.sh from referenced service configurations.  Place in /tmp.

Comment 9 Lon Hohberger 2011-06-23 14:48:49 UTC

Unit test result before patch:

Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 10:25:38 rhel5-1 clurgmgrd[16856]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 10:25:38 rhel5-1 clurgmgrd[16856]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test2.sh stop
Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test1.sh stop
Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test1.sh start
Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test2.sh start
Jun 23 10:25:38 rhel5-1 clurgmgrd[16856]: <notice> Inline recovery of service:test complete

Comment 11 Lon Hohberger 2011-06-23 14:54:21 UTC

Unit test (comment #6) after patch:

Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 10:53:31 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 10:53:31 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 10:53:31 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 

[root@rhel5-1 ~]# rpm -q rgmanager
rgmanager-2.0.52-21.el5

Comment 12 Lon Hohberger 2011-06-23 15:00:02 UTC

Problem introduced here:

http://git.fedorahosted.org/git/?p=cluster.git;a=blobdiff;f=rgmanager/src/daemons/restree.c;h=ea458d696362e3605c6253731aa579cd3ccc3a4d;hp=3a03f913959eaac798563fa7dd0af0163bb918b5;hb=06993e7d6253dbb9a0e83c8edeba4d7a99f61954;hpb=f17eaaf6827237cd13d9086e7b1fbd6eaf702db1

I now must perform a full retest of 605733 to ensure changing the line back to what it was prior does not cause a regression in the Non-Critical functionality.

Comment 13 Lon Hohberger 2011-06-23 15:03:02 UTC

Unit test 1 (605733):

1) Setting test2 to __independent_subtree="2" in cluster.conf should cause the test2 script to be disabled, and the service to add the partial flag:

Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 
Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. 
Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. 
Jun 23 11:02:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 

PASS

Comment 14 Lon Hohberger 2011-06-23 15:05:50 UTC

Unit test 2 (605733):

Adding __max_restarts="1" __restart_expire_time="3600" should cause a recovery of just test2, followed by a quiesce of test2:

Jun 23 11:04:54 rhel5-1 clurgmgrd[20911]: <info> Starting changed resources. 
Jun 23 11:04:57 rhel5-1 clurgmgrd[20911]: <info> Repairing service:test 
Jun 23 11:04:57 rhel5-1 clurgmgrd[20911]: <info> Repair of service:test was successful 
Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 11:05:01 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 11:05:01 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 
Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 
Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. 
Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. 

PASS

Comment 15 Lon Hohberger 2011-06-23 15:08:33 UTC

Unit test 3 (605733):

                <service name="test">
                        <script name="a" file="/tmp/test1.sh" __independent_subtree="1">
                                <script name="b" file="/tmp/test2.sh" __independent_subtree="2">
                                        <script name="truth" file="/bin/true"/>
                                </script>
                        </script>
                </service>


Adding a child script (in this case, /bin/true) should result in both the test2 script and the new child script to be stopped on failure.

Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop 
Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 
Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. 
Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. 
Jun 23 11:07:21 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status

3.b: convalesce should restore both to operation:

Jun 23 11:07:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 11:08:14 rhel5-1 clurgmgrd[20911]: <info> Repairing service:test 
Jun 23 11:08:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 11:08:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start 
Jun 23 11:08:14 rhel5-1 clurgmgrd[20911]: <info> Repair of service:test was successful 

PASS

Comment 16 Lon Hohberger 2011-06-23 15:11:26 UTC

Unit test 2 (this bug):

                <service name="test">
                        <script name="a" file="/tmp/test1.sh" __independent_subtree="1">
                                <script name="b" file="/tmp/test2.sh" __independent_subtree="1">
                                        <script name="truth" file="/bin/true"/>
                                </script>
                        </script>
                </service>

Independent subtree below and including b should be restarted if b fails.

Jun 23 11:10:41 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 11:11:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 11:11:01 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 11:11:01 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 11:11:01 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 11:11:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop 
Jun 23 11:11:02 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 11:11:02 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 11:11:02 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start 
Jun 23 11:11:02 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 

PASS

Comment 17 Lon Hohberger 2011-06-23 15:13:52 UTC

Unit test 3 (this bug):

                <service name="test">
                        <script name="a" file="/tmp/test1.sh" __independent_subtree="1">
                                <script name="truth" file="/bin/true">
                                        <script name="b" file="/tmp/test2.sh"/>
                                </script>
                        </script>
                </service>

test2.sh's failure should be propagated up to test1.sh and all three should be restarted.

Jun 23 11:13:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 11:13:32 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 11:13:32 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start 
Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 11:13:32 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 

PASS

Comment 19 Lon Hohberger 2011-06-23 16:55:24 UTC

Unit test 4 (this bug): 

                <service name="test">
                        <script name="a" file="/tmp/test1.sh" __independent_subtree="1" >
                                <script name="b" file="/tmp/test2.sh"/>
                        </script>
                        <script name="truth" file="/bin/true"/>
                </service>

test2's failure should cause a restart of test1 and test2, but not affect /bin/true.

Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 12:54:13 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status 
Jun 23 12:54:13 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop 
Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start 
Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 12:54:14 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 

PASS

Comment 20 Lon Hohberger 2011-06-23 16:57:28 UTC

Unit test 4 (605733):

                <service name="test">
                        <script name="a" file="/tmp/test1.sh" __independent_subtree="2">
                                <script name="b" file="/tmp/test2.sh"/>
                        </script>
                        <script name="truth" file="/bin/true"/>
                </service>

After test2 fails, test1 and test2 should be quiesced, and /bin/true should remain operational.

Jun 23 12:55:23 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 12:55:23 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 12:55:24 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status 
Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 12:55:24 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 12:55:24 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop 
Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 
Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. 
Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. 
Jun 23 12:56:03 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status 

4.b: clusvcadm -c should restore test2 and test1 to operation

Jun 23 12:57:09 rhel5-1 clurgmgrd[20911]: <info> Repairing service:test 
Jun 23 12:57:10 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start 
Jun 23 12:57:10 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 12:57:10 rhel5-1 clurgmgrd[20911]: <info> Repair of service:test was successful 

PASS

Comment 21 Lon Hohberger 2011-06-23 17:01:00 UTC

Unit test 5 (this bug):

                <service name="test">
                        <script name="a" file="/tmp/test1.sh" __independent_subtree="1" __max_restarts="1" __restart_expire_time="3600">
                                <script name="b" file="/tmp/test2.sh"/>
                        </script>
                        <script name="truth" file="/bin/true"/>
                </service>

The first failure of test2 should cause a restart of just test1 and test2.  The second should cause a restart of the entire service.

Jun 23 12:59:03 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status 
Jun 23 12:59:04 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop 
Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start 
Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 12:59:04 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 
Jun 23 12:59:43 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Stopping service service:test 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop 
Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Service service:test is recovering 
Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Recovering failed service service:test 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start 
Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Service service:test started 

PASS

Comment 22 Lon Hohberger 2011-06-23 17:03:19 UTC

Unit test 6:

                <service name="test">
                        <script name="a" file="/tmp/test1.sh"> 
                                <script name="b" file="/tmp/test2.sh" __independent_subtree="1" __max_restarts="1" __restart_expire_time="3600">
                        </script>
                        <script name="truth" file="/bin/true"/>
                </service>

The first failure of test2 should restart just test2, the second should restart the whole service.

Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status 
Jun 23 13:02:34 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery 
Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 13:02:34 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete 
Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 
Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> Stopping service service:test 
Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop 
Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop 
Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> Service service:test is recovering 
Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> Recovering failed service service:test 
Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start 
Jun 23 13:03:05 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 13:03:05 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start 
Jun 23 13:03:05 rhel5-1 clurgmgrd[20911]: <notice> Service service:test started

PASS

Comment 23 Lon Hohberger 2011-06-23 17:24:54 UTC

Unit test 7 (this bug):

                <service name="test">
                        <script name="a" file="/tmp/test1.sh"/>
                        <script name="b" file="/tmp/test2.sh"/>
                        <script name="truth" file="/bin/true"/>
                </service>

When test2.sh fails, the whole service must be restarted.

Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status 
Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) 
Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) 
Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> Stopping service service:test 
Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop 
Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop 
Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop 
Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> Service service:test is recovering 
Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> Recovering failed service service:test 
Jun 23 13:24:37 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start 
Jun 23 13:24:37 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start 
Jun 23 13:24:37 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start 
Jun 23 13:24:37 rhel5-1 clurgmgrd[20911]: <notice> Service service:test started 

PASS

Comment 28 Jaroslav Kortus 2011-06-27 16:12:19 UTC

Thanks Lon for this excellent testing coverage.
I've tried all the tests you described and everything worked as expected.

The only minor issue I found was that when the service is updated not to be in independent_tree="2" the partial status remains until the service is restarted/relocated. In other words, the automated restart did not clear the flag. This has no real effect on the service behaviour, just the output of clustat is confusing.

Marking as verified, thank you again :).

rgmanager-2.0.52-21.el5 @x86_64

Comment 29 Lon Hohberger 2011-06-28 13:53:24 UTC

No problem there -- not clearing the partial flag is a known issue which will not be fixed; it's noted here:

https://bugzilla.redhat.com/show_bug.cgi?id=605733#c14

Known Issues, item I.A.

Comment 30 errata-xmlrpc 2011-07-21 10:44:25 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1000.html

Note You need to log in before you can comment on or make changes to this bug.