Description of problem: Two Cluster Services are defined. The first service "sapgt1ci" starts th e database and the central SAP instance. The second cluster service "sapgt1ap1" starts another SAP application instance. "sapgt1ap1" has a hard dependency on sapgt1ci. While executing the cluster tests / failover tests we encountered a bug: Setup: Cluster Server lxcn1gt1 running no services and cluster server lxcn2gt1 running both sapgt1ci and sapgt1ap1. Cluster node lxcn2gt1 (the one running everything) gets switched off (power off). Now both services get migrated to the remaining node. The start sequence is correct and matches the configured dependency. But after both service s were successfully started sapgt1ap1 gets stopped again (!) while rgmanager prints a message about "lost dependency" into the syslog. Afte r sapgt1ap1 was successfully stopped it gets started again. This time everything stays up. The re-start of sapgt1ap1 should not occur. with the services swapped in /etc/cluster/cluster.conf (now sapgt1ap1 is defined first and the sapgtci second) we see a slightly different behavi or: If both services run on the same cluster node (in our test case lxcn2gt1-hb) and this node gets powered off....: The remaining node (lxcn1gt1-hb) first tries to start sapgt1ap1 (completely ignoring the hard dependency on sapgt1ci!!) - this start attempt fails without the central instance running. Then it starts sapgt1ci (SAP DB and central instance) and then it starts sapgt1ap1 again (which works this time). To recapture the initial problem with the service definition of sapgt1ci first in /etc/cluster/cluster.conf and sapgtap1 second: During the failover of both services the remaining node - first starts sapgt1ci - then starts sapgt1ap1 - then stops sapgt1ap1 (message about missing dependency) - then starts sapgt1ap1 again Both start sequences are flawed..... Nevertheless all instances were started in both cases (with an extra ste p in between). Version-Release number of selected component (if applicable): rgmanager-2.0.52-21.el5.x86_64 How reproducible: always
Created attachment 526209 [details] cluster.conf cluster.conf from customer
Created attachment 526212 [details] syslog entries during test showing the duplicate start of dependent service
Here, b depends on a -- both were on member 1, and when it failed: Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <info> Marking service service:a on down member 1 as stopped Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <notice> Starting service:a on [ 2 ] Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <debug> Sent remote-start request to 2 Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <notice> Starting stopped service service:a Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <notice> Service service:a is now running on member 2 Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <debug> Evaluating service:b state=started owner=1 Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <info> Marking service service:b on down member 1 as stopped Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <notice> Starting service:b on [ 2 ] Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <debug> Sent remote-start request to 2 Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <notice> Starting stopped service service:b Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <notice> Service service:b started Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <notice> Service service:b is now running on member 2 Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> Executing default service event handler Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> Skipping service:b; restart not needed Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> Executing default service event handler Jan 5 00:14:33 rhel5-2 last message repeated 2 times Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> 5 events processed The important line is the 'Skipping service:b' debug information - prior to the patches here, this is where service:b would have been restarted.
https://www.redhat.com/archives/cluster-devel/2012-January/msg00006.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00007.html
Some bugs with the patches, so I am submitting new ones. Test report (after uncommenting debug lines in default_event_script.sl) with fixed version for: * config #1: <rm central_processing="1" log_level="7" ... > <service "a" /> <service "b" depend="service:a" /> </rm> * failure case #1 a and b on node 1, kill node 1. Expected results, move A over, move B over, B does not cycle: Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> Membership Change Event Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <info> State change: rhel5-1.lhh.pvt DOWN Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> sorted_service_list service:a, service:b Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:a Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:a state=started owner=1 Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <info> Marking service service:a on down member 1 as stopped Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Starting service:a on [ 2 ] Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2 Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:a Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Service service:a started Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Service service:a is now running on member 2 Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=started owner=1 Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <info> Marking service service:b on down member 1 as stopped Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Starting service:b on [ 2 ] Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2 Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:b Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Service service:b started Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is now running on member 2 Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> S:service:a trans 1325776885 Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> D:service:b trans 1325776886 Jan 5 10:21:27 rhel5-2 clurgmgrd[4759]: <debug> Skipping service:b; restart not needed Jan 5 10:21:27 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:21:27 rhel5-2 clurgmgrd[4759]: <debug> 5 events processed Result: PASS * failure case #2 a on node 1, b on node 2, kill node 1. Expected results, move a over, b restarts: Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> Membership Change Event Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <info> State change: rhel5-1.lhh.pvt DOWN Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> sorted_service_list service:a, service:b Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:a Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:a state=started owner=1 Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <info> Marking service service:a on down member 1 as stopped Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Starting service:a on [ 2 ] Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2 Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:a Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Service service:a started Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Service service:a is now running on member 2 Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=started owner=2 Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> service:b is already running on best node Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <info> Dependency lost; stopping service:b Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Stopping service service:b Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is stopped Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <info> Dependency met; starting service:b Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=stopped owner=-1 Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Starting service:b on [ 2 ] Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2 Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:b Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Service service:b started Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is now running on member 2 Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> 5 events processed Result: PASS * failure case #3 b on node 1, a on node 2, kill node 1. Expected results: move b over Jan 5 10:31:18 rhel5-2 clurgmgrd[4759]: <info> State change: rhel5-1.lhh.pvt DOWN Jan 5 10:31:18 rhel5-2 clurgmgrd[4759]: <debug> sorted_service_list service:a, service:b Jan 5 10:31:18 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:a Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:a state=started owner=2 Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> service:a is already running on best node Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=started owner=1 Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <info> Marking service service:b on down member 1 as stopped Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <notice> Starting service:b on [ 2 ] Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2 Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:b Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <notice> Service service:b started Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is now running on member 2 Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <debug> 3 events processed Result: PASS
* config #2 (similar to config #1 except services in opposite order) <rm central_processing="1" log_level="7" ... > <service "b" depend="service:a" /> <service "a" /> </rm> * failure case #1 (same expected results, but logs will differ) Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <debug> Membership Change Event Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <info> State change: rhel5-1.lhh.pvt DOWN Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <debug> sorted_service_list service:b, service:a Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> service:b is not runnable; dependency not met Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:a Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:a state=started owner=1 Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <info> Marking service service:a on down member 1 as stopped Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <notice> Starting service:a on [ 2 ] Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2 Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:a Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:a started Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:a is now running on member 2 Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <info> Dependency met; starting service:b Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:b state=stopped owner=-1 Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <notice> Starting service:b on [ 2 ] Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2 Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:b Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <notice> Service service:b started Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is now running on member 2 Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <debug> 5 events processed Result: PASS * failure case 2, same expected results: Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> Membership Change Event Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <info> State change: rhel5-1.lhh.pvt DOWN Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> sorted_service_list service:b, service:a Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> service:b is not runnable; dependency not met Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <notice> Stopping service service:b Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is stopped Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:a Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:a state=started owner=1 Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <info> Marking service service:a on down member 1 as stopped Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <notice> Starting service:a on [ 2 ] Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2 Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:a Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Service service:a started Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Service service:a is now running on member 2 Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <info> Dependency met; starting service:b Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:b state=stopped owner=-1 Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Starting service:b on [ 2 ] Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2 Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:b Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <notice> Service service:b started Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is now running on member 2 Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <debug> 5 events processed Result: PASS * failure case 3, same expected results: Jan 5 10:51:18 rhel5-2 clurgmgrd[6800]: <debug> Membership Change Event Jan 5 10:51:18 rhel5-2 clurgmgrd[6800]: <info> State change: rhel5-1.lhh.pvt DOWN Jan 5 10:51:18 rhel5-2 clurgmgrd[6800]: <debug> sorted_service_list service:b, service:a Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:b state=started owner=1 Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <info> Marking service service:b on down member 1 as stopped Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <notice> Starting service:b on [ 2 ] Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2 Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:b Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:b started Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is now running on member 2 Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:a Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:a state=started owner=2 Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> service:a is already running on best node Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:51:21 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler Jan 5 10:51:21 rhel5-2 clurgmgrd[6800]: <debug> 3 events processed Result: PASS
https://www.redhat.com/archives/cluster-devel/2012-January/msg00033.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00034.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00035.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00036.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00037.html
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: A "stopped" event for a given service was being processed after the a service's dependent service had already been restarted. Consequence: The dependent service would erroneously restart. Fix: Ignore the "stopped" event if both services are running and the dependent service was started after the dependency. Result: The "stopped" event no longer causes the dependent service to restart.
*** Bug 679753 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0163.html