Bug 743214
| Summary: | Cluster problem: "hard" service dependencies | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Frank Danapfel <fdanapfe> | ||||||
| Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 5.7 | CC: | cluster-maint, costan, djansa, edamato, jentrena, rsteiger | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 5.8 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | rgmanager-2.0.52-27.el5 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause: A "stopped" event for a given service was being processed after the a service's dependent service had already been restarted.
Consequence: The dependent service would erroneously restart.
Fix: Ignore the "stopped" event if both services are running and the dependent service was started after the dependency.
Result: The "stopped" event no longer causes the dependent service to restart.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 743218 (view as bug list) | Environment: | |||||||
| Last Closed: | 2012-02-21 06:19:28 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 743218, 772956 | ||||||||
| Attachments: |
|
||||||||
|
Description
Frank Danapfel
2011-10-04 09:24:46 UTC
Created attachment 526209 [details]
cluster.conf
cluster.conf from customer
Created attachment 526212 [details]
syslog entries during test showing the duplicate start of dependent service
Here, b depends on a -- both were on member 1, and when it failed: Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <info> Marking service service:a on down member 1 as stopped Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <notice> Starting service:a on [ 2 ] Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <debug> Sent remote-start request to 2 Jan 5 00:14:31 rhel5-2 clurgmgrd[3023]: <notice> Starting stopped service service:a Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <notice> Service service:a is now running on member 2 Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <debug> Executing sortedservices node event handler for service: service:b Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <debug> Evaluating service:b state=started owner=1 Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <info> Marking service service:b on down member 1 as stopped Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <notice> Starting service:b on [ 2 ] Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <debug> Sent remote-start request to 2 Jan 5 00:14:32 rhel5-2 clurgmgrd[3023]: <notice> Starting stopped service service:b Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <notice> Service service:b started Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <notice> Service service:b is now running on member 2 Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> Executing default service event handler Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> Skipping service:b; restart not needed Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> Executing default service event handler Jan 5 00:14:33 rhel5-2 last message repeated 2 times Jan 5 00:14:33 rhel5-2 clurgmgrd[3023]: <debug> 5 events processed The important line is the 'Skipping service:b' debug information - prior to the patches here, this is where service:b would have been restarted. https://www.redhat.com/archives/cluster-devel/2012-January/msg00006.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00007.html Some bugs with the patches, so I am submitting new ones.
Test report (after uncommenting debug lines in default_event_script.sl) with fixed version for:
* config #1:
<rm central_processing="1" log_level="7" ... >
<service "a" />
<service "b" depend="service:a" />
</rm>
* failure case #1
a and b on node 1, kill node 1.
Expected results, move A over, move B over, B does not cycle:
Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> Membership Change Event
Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <info> State change: rhel5-1.lhh.pvt DOWN
Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> sorted_service_list service:a, service:b
Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:a
Jan 5 10:21:24 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:a state=started owner=1
Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <info> Marking service service:a on down member 1 as stopped
Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Starting service:a on [ 2 ]
Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2
Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:a
Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Service service:a started
Jan 5 10:21:25 rhel5-2 clurgmgrd[4759]: <notice> Service service:a is now running on member 2
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:b
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=started owner=1
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <info> Marking service service:b on down member 1 as stopped
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Starting service:b on [ 2 ]
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:b
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Service service:b started
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is now running on member 2
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> S:service:a trans 1325776885
Jan 5 10:21:26 rhel5-2 clurgmgrd[4759]: <debug> D:service:b trans 1325776886
Jan 5 10:21:27 rhel5-2 clurgmgrd[4759]: <debug> Skipping service:b; restart not needed
Jan 5 10:21:27 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:21:27 rhel5-2 clurgmgrd[4759]: <debug> 5 events processed
Result: PASS
* failure case #2
a on node 1, b on node 2, kill node 1.
Expected results, move a over, b restarts:
Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> Membership Change Event
Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <info> State change: rhel5-1.lhh.pvt DOWN
Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> sorted_service_list service:a, service:b
Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:a
Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:a state=started owner=1
Jan 5 10:28:27 rhel5-2 clurgmgrd[4759]: <info> Marking service service:a on down member 1 as stopped
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Starting service:a on [ 2 ]
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:a
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Service service:a started
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <notice> Service service:a is now running on member 2
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:b
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=started owner=2
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> service:b is already running on best node
Jan 5 10:28:28 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <info> Dependency lost; stopping service:b
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Stopping service service:b
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is stopped
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <info> Dependency met; starting service:b
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=stopped owner=-1
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Starting service:b on [ 2 ]
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:b
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Service service:b started
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is now running on member 2
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:28:29 rhel5-2 clurgmgrd[4759]: <debug> 5 events processed
Result: PASS
* failure case #3
b on node 1, a on node 2, kill node 1.
Expected results: move b over
Jan 5 10:31:18 rhel5-2 clurgmgrd[4759]: <info> State change: rhel5-1.lhh.pvt DOWN
Jan 5 10:31:18 rhel5-2 clurgmgrd[4759]: <debug> sorted_service_list service:a, service:b
Jan 5 10:31:18 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:a
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:a state=started owner=2
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> service:a is already running on best node
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Executing sortedservices node event handler for service: service:b
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Evaluating service:b state=started owner=1
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <info> Marking service service:b on down member 1 as stopped
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <notice> Starting service:b on [ 2 ]
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <debug> Sent remote-start request to 2
Jan 5 10:31:19 rhel5-2 clurgmgrd[4759]: <notice> Starting stopped service service:b
Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <notice> Service service:b started
Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <notice> Service service:b is now running on member 2
Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <debug> Executing default service event handler
Jan 5 10:31:20 rhel5-2 clurgmgrd[4759]: <debug> 3 events processed
Result: PASS
* config #2 (similar to config #1 except services in opposite order)
<rm central_processing="1" log_level="7" ... >
<service "b" depend="service:a" />
<service "a" />
</rm>
* failure case #1 (same expected results, but logs will differ)
Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <debug> Membership Change Event
Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <info> State change: rhel5-1.lhh.pvt DOWN
Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <debug> sorted_service_list service:b, service:a
Jan 5 10:40:18 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:b
Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> service:b is not runnable; dependency not met
Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:a
Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:a state=started owner=1
Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <info> Marking service service:a on down member 1 as stopped
Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <notice> Starting service:a on [ 2 ]
Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2
Jan 5 10:40:19 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:a
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:a started
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:a is now running on member 2
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <info> Dependency met; starting service:b
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:b state=stopped owner=-1
Jan 5 10:40:20 rhel5-2 clurgmgrd[6800]: <notice> Starting service:b on [ 2 ]
Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2
Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:b
Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <notice> Service service:b started
Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is now running on member 2
Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:40:21 rhel5-2 clurgmgrd[6800]: <debug> 5 events processed
Result: PASS
* failure case 2, same expected results:
Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> Membership Change Event
Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <info> State change: rhel5-1.lhh.pvt DOWN
Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> sorted_service_list service:b, service:a
Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:b
Jan 5 10:48:40 rhel5-2 clurgmgrd[6800]: <debug> service:b is not runnable; dependency not met
Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <notice> Stopping service service:b
Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is stopped
Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:a
Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:a state=started owner=1
Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <info> Marking service service:a on down member 1 as stopped
Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <notice> Starting service:a on [ 2 ]
Jan 5 10:48:41 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2
Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:a
Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Service service:a started
Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Service service:a is now running on member 2
Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <info> Dependency met; starting service:b
Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:b state=stopped owner=-1
Jan 5 10:48:42 rhel5-2 clurgmgrd[6800]: <notice> Starting service:b on [ 2 ]
Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2
Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:b
Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <notice> Service service:b started
Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is now running on member 2
Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:48:43 rhel5-2 clurgmgrd[6800]: <debug> 5 events processed
Result: PASS
* failure case 3, same expected results:
Jan 5 10:51:18 rhel5-2 clurgmgrd[6800]: <debug> Membership Change Event
Jan 5 10:51:18 rhel5-2 clurgmgrd[6800]: <info> State change: rhel5-1.lhh.pvt DOWN
Jan 5 10:51:18 rhel5-2 clurgmgrd[6800]: <debug> sorted_service_list service:b, service:a
Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:b
Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:b state=started owner=1
Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <info> Marking service service:b on down member 1 as stopped
Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <notice> Starting service:b on [ 2 ]
Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <debug> Sent remote-start request to 2
Jan 5 10:51:19 rhel5-2 clurgmgrd[6800]: <notice> Starting stopped service service:b
Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:b started
Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <notice> Service service:b is now running on member 2
Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> Executing sortedservices node event handler for service: service:a
Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> Evaluating service:a state=started owner=2
Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> service:a is already running on best node
Jan 5 10:51:20 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:51:21 rhel5-2 clurgmgrd[6800]: <debug> Executing default service event handler
Jan 5 10:51:21 rhel5-2 clurgmgrd[6800]: <debug> 3 events processed
Result: PASS
https://www.redhat.com/archives/cluster-devel/2012-January/msg00033.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00034.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00035.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00036.html https://www.redhat.com/archives/cluster-devel/2012-January/msg00037.html
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Cause: A "stopped" event for a given service was being processed after the a service's dependent service had already been restarted.
Consequence: The dependent service would erroneously restart.
Fix: Ignore the "stopped" event if both services are running and the dependent service was started after the dependency.
Result: The "stopped" event no longer causes the dependent service to restart.
*** Bug 679753 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0163.html |