Description of problem: Two script resources (parent and sibbling) are included in a service with __independent_tree="1". With rgmanager version 2.0.52-9.el5, when child resource is detected as failed, both resources are restarted insted of only the child one. With rgmanager version 2.0.52-6.el5_5.8 it works as expected. Extract of cluster.conf: <service nfslock="1" autostart="1" domain="node1-first" exclusive="0" max_restarts="3" name="test" recovery="relocate" restart_expire_time="900"> <script file="/root/test1" name="script1" __independent_subtree="1"> <script file="/root/test2" name="script2" __independent_subtree="1"/> </script> </service> Version-Release number of selected component (if applicable): 2.0.52-9.el5 How reproducible: Always Steps to Reproduce: 1. Create a service with parent and child resource and mark both with independent_tree to 1 2. make sibling resource to fail Actual results: Both parent and sibling (in the example script1 and script2) are restarted: Jun 7 19:57:50 node1 clurgmgrd[14575]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 7 19:57:51 node1 logger: stop test2 Jun 7 19:57:51 node1 logger: stop test1 Jun 7 19:57:51 node1 logger: start test1 Jun 7 19:57:51 node1 logger: start test2 Jun 7 19:57:51 node1 clurgmgrd[14575]: <notice> Inline recovery of service:test complete Expected results: Only the child resource (script2) is restarted. Output with version 2.0.52-6.el5_5.8 Jun 7 19:52:20 node1 clurgmgrd[11160]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 7 19:52:20 node1 logger: stop test2 Jun 7 19:52:20 node1 logger: start test2 Jun 7 19:52:20 node1 clurgmgrd[11160]: <notice> Inline recovery of service:test succeeded Additional info:
Reproduced.
Created attachment 506331 [details] Fix
Example service configuration: <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="1"> <script name="b" file="/tmp/test2.sh" __independent_subtree="2"/> </script> </service>
Oops, that's for regression testing against the non-critical services. Here's the reproducer I used: <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="1"> <script name="b" file="/tmp/test2.sh" __independent_subtree="1"/> </script> </service>
Created attachment 506568 [details] test1.sh from referenced service configurations. Place in /tmp.
Created attachment 506583 [details] test2.sh from referenced service configurations. Place in /tmp.
Unit test result before patch: Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 10:25:38 rhel5-1 clurgmgrd[16856]: <notice> status on script "b" returned 1 (generic error) Jun 23 10:25:38 rhel5-1 clurgmgrd[16856]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test2.sh stop Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test1.sh stop Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test1.sh start Jun 23 10:25:38 rhel5-1 clurgmgrd: [16856]: <info> Executing /tmp/test2.sh start Jun 23 10:25:38 rhel5-1 clurgmgrd[16856]: <notice> Inline recovery of service:test complete
Unit test (comment #6) after patch: Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 10:53:31 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 10:53:31 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 10:53:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 10:53:31 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete [root@rhel5-1 ~]# rpm -q rgmanager rgmanager-2.0.52-21.el5
Problem introduced here: http://git.fedorahosted.org/git/?p=cluster.git;a=blobdiff;f=rgmanager/src/daemons/restree.c;h=ea458d696362e3605c6253731aa579cd3ccc3a4d;hp=3a03f913959eaac798563fa7dd0af0163bb918b5;hb=06993e7d6253dbb9a0e83c8edeba4d7a99f61954;hpb=f17eaaf6827237cd13d9086e7b1fbd6eaf702db1 I now must perform a full retest of 605733 to ensure changing the line back to what it was prior does not cause a regression in the Non-Critical functionality.
Unit test 1 (605733): 1) Setting test2 to __independent_subtree="2" in cluster.conf should cause the test2 script to be disabled, and the service to add the partial flag: Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 11:01:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. Jun 23 11:01:31 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. Jun 23 11:02:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status PASS
Unit test 2 (605733): Adding __max_restarts="1" __restart_expire_time="3600" should cause a recovery of just test2, followed by a quiesce of test2: Jun 23 11:04:54 rhel5-1 clurgmgrd[20911]: <info> Starting changed resources. Jun 23 11:04:57 rhel5-1 clurgmgrd[20911]: <info> Repairing service:test Jun 23 11:04:57 rhel5-1 clurgmgrd[20911]: <info> Repair of service:test was successful Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 11:05:01 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 11:05:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 11:05:01 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 11:05:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. Jun 23 11:05:31 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. PASS
Unit test 3 (605733): <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="1"> <script name="b" file="/tmp/test2.sh" __independent_subtree="2"> <script name="truth" file="/bin/true"/> </script> </script> </service> Adding a child script (in this case, /bin/true) should result in both the test2 script and the new child script to be stopped on failure. Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop Jun 23 11:06:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. Jun 23 11:06:51 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. Jun 23 11:07:21 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status 3.b: convalesce should restore both to operation: Jun 23 11:07:51 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 11:08:14 rhel5-1 clurgmgrd[20911]: <info> Repairing service:test Jun 23 11:08:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 11:08:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start Jun 23 11:08:14 rhel5-1 clurgmgrd[20911]: <info> Repair of service:test was successful PASS
Unit test 2 (this bug): <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="1"> <script name="b" file="/tmp/test2.sh" __independent_subtree="1"> <script name="truth" file="/bin/true"/> </script> </script> </service> Independent subtree below and including b should be restarted if b fails. Jun 23 11:10:41 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 11:11:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 11:11:01 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 11:11:01 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 11:11:01 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 11:11:01 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop Jun 23 11:11:02 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 11:11:02 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 11:11:02 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start Jun 23 11:11:02 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete PASS
Unit test 3 (this bug): <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="1"> <script name="truth" file="/bin/true"> <script name="b" file="/tmp/test2.sh"/> </script> </script> </service> test2.sh's failure should be propagated up to test1.sh and all three should be restarted. Jun 23 11:13:31 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 11:13:32 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 11:13:32 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start Jun 23 11:13:32 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 11:13:32 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete PASS
Unit test 4 (this bug): <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="1" > <script name="b" file="/tmp/test2.sh"/> </script> <script name="truth" file="/bin/true"/> </service> test2's failure should cause a restart of test1 and test2, but not affect /bin/true. Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 12:54:13 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 12:54:13 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status Jun 23 12:54:13 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start Jun 23 12:54:14 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 12:54:14 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete PASS
Unit test 4 (605733): <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="2"> <script name="b" file="/tmp/test2.sh"/> </script> <script name="truth" file="/bin/true"/> </service> After test2 fails, test1 and test2 should be quiesced, and /bin/true should remain operational. Jun 23 12:55:23 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 12:55:23 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 12:55:24 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 12:55:24 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 12:55:24 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <notice> Note: Some non-critical resources were stopped during recovery. Jun 23 12:55:24 rhel5-1 clurgmgrd[20911]: <notice> Run 'clusvcadm -c service:test' to restore them to operation. Jun 23 12:56:03 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status 4.b: clusvcadm -c should restore test2 and test1 to operation Jun 23 12:57:09 rhel5-1 clurgmgrd[20911]: <info> Repairing service:test Jun 23 12:57:10 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start Jun 23 12:57:10 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 12:57:10 rhel5-1 clurgmgrd[20911]: <info> Repair of service:test was successful PASS
Unit test 5 (this bug): <service name="test"> <script name="a" file="/tmp/test1.sh" __independent_subtree="1" __max_restarts="1" __restart_expire_time="3600"> <script name="b" file="/tmp/test2.sh"/> </script> <script name="truth" file="/bin/true"/> </service> The first failure of test2 should cause a restart of just test1 and test2. The second should cause a restart of the entire service. Jun 23 12:59:03 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status Jun 23 12:59:04 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start Jun 23 12:59:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 12:59:04 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete Jun 23 12:59:43 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Stopping service service:test Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Service service:test is recovering Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Recovering failed service service:test Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 12:59:44 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start Jun 23 12:59:44 rhel5-1 clurgmgrd[20911]: <notice> Service service:test started PASS
Unit test 6: <service name="test"> <script name="a" file="/tmp/test1.sh"> <script name="b" file="/tmp/test2.sh" __independent_subtree="1" __max_restarts="1" __restart_expire_time="3600"> </script> <script name="truth" file="/bin/true"/> </service> The first failure of test2 should restart just test2, the second should restart the whole service. Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true status Jun 23 13:02:34 rhel5-1 clurgmgrd[20911]: <warning> Some independent resources in service:test failed; Attempting inline recovery Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 13:02:34 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 13:02:34 rhel5-1 clurgmgrd[20911]: <notice> Inline recovery of service:test complete Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh status Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> Stopping service service:test Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> Service service:test is recovering Jun 23 13:03:04 rhel5-1 clurgmgrd[20911]: <notice> Recovering failed service service:test Jun 23 13:03:04 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start Jun 23 13:03:05 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 13:03:05 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start Jun 23 13:03:05 rhel5-1 clurgmgrd[20911]: <notice> Service service:test started PASS
Unit test 7 (this bug): <service name="test"> <script name="a" file="/tmp/test1.sh"/> <script name="b" file="/tmp/test2.sh"/> <script name="truth" file="/bin/true"/> </service> When test2.sh fails, the whole service must be restarted. Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh status Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <err> script:b: status of /tmp/test2.sh failed (returned 1) Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> status on script "b" returned 1 (generic error) Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> Stopping service service:test Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true stop Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh stop Jun 23 13:24:36 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh stop Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> Service service:test is recovering Jun 23 13:24:36 rhel5-1 clurgmgrd[20911]: <notice> Recovering failed service service:test Jun 23 13:24:37 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test1.sh start Jun 23 13:24:37 rhel5-1 clurgmgrd: [20911]: <info> Executing /tmp/test2.sh start Jun 23 13:24:37 rhel5-1 clurgmgrd: [20911]: <info> Executing /bin/true start Jun 23 13:24:37 rhel5-1 clurgmgrd[20911]: <notice> Service service:test started PASS
Thanks Lon for this excellent testing coverage. I've tried all the tests you described and everything worked as expected. The only minor issue I found was that when the service is updated not to be in independent_tree="2" the partial status remains until the service is restarted/relocated. In other words, the automated restart did not clear the flag. This has no real effect on the service behaviour, just the output of clustat is confusing. Marking as verified, thank you again :). rgmanager-2.0.52-21.el5 @x86_64
No problem there -- not clearing the partial flag is a known issue which will not be fixed; it's noted here: https://bugzilla.redhat.com/show_bug.cgi?id=605733#c14 Known Issues, item I.A.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1000.html