Bug 247772
Summary: | RFE: One service following another | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Mark Hlawatschek <hlawatschek> | ||||||||||||||||
Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||
Priority: | medium | ||||||||||||||||||
Version: | 4 | CC: | cluster-maint, grimme, helge.deller, nphilipp, rdoty | ||||||||||||||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | RHBA-2008-0791 | Doc Type: | Enhancement | ||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2008-07-25 19:15:14 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | 247980, 250101 | ||||||||||||||||||
Bug Blocks: | 251044, 367631 | ||||||||||||||||||
Attachments: |
|
Description
Mark Hlawatschek
2007-07-11 12:01:28 UTC
Created attachment 158940 [details]
basic idea of service following another
Hi Mark, Going over this again, I whiteboarded was something like: 1 2 3 +-----+ +-----+ +-----+ | A | | B | | | A and B are on separate nodes | | | | | | +-----+ +-----+ +-----+ Node 1 dies. +- - -+ +-----+ +-----+ | A | B | | | B is running; A is on dead node 1 | | | | | + - - + +-----+ +-----+ Node 1 is fenced. Node 2 starts A + + +-----+ +-----+ | A | | | | B | | | + + +-----+ +-----+ After A's startup is complete, node 2 stops B + + +-----+ +-----+ | A | | | | | | | + + +-----+ +-----+ Finally, node 3 starts B + + +-----+ +-----+ | A | | B | | | | | + + +-----+ +-----+ Now - what I would like to know, is... paint me a picture of what happens if nod e 2 failed instead of node 1. I imagine it's just "node 3 starts B". Also, as far as I'm aware, in the particular instance we're concerned with (SAP), this is mostly an optimization, correct? It could be that we just start 'A' on node 3. Restoring from the replication server can occur over the network, but at a significant performance hit. Also - the 'avoid' patch can be more or less done with the exclusive flag (or should be able to be done) in most cases, unless there are more services than nodes. Hi Lon, the following picture shows the case node 2 failes: 1 2 3 +-----+ +-----+ +-----+ | A | | B | | | A and B are on separate nodes | | | | | | +-----+ +-----+ +-----+ Node 2 dies. +-----+ +- - -+ +-----+ | A | | B | | A is running; B is on dead node 2 | | | | | +-----+ +- - -+ +-----+ Node 2 is fenced. Node 3 starts B +-----+ + + +-----+ | A | | B | B is running on node 3 | | | | +-----+ + + +-----+ The enqueue service must be started on the node, where the replication service is running. The enqueue service will then attach the shared memory segment holding the data (locktables). If the HA software does not support this feature, the "polling" concept must be used. I.e. the replication service must be started on all nodes in the failover domain. The drawback: multiple replication servers are causing a significant performance loss for the enqueing service as the replication is done synchronously. A performance hit of the enqueing service would cause a performance hit for the whole SAP application. I assume that technically the exclusive flag could be used to permit to start the replication service on the same node where the enqueing service runs. But normally multiple cluster services are running an a SAP cluster. The enque replication service should be able to share a node with other services. Normally it wouldn't be an option to keep exclusive servers for the enque replication. Note: bug #247776 is the same for RHCS5. I'm not sure what version this should be targeted for - I set the flag for cluster-4.6. If this is wrong, please set the flag properly. Created attachment 179501 [details]
Preliminary event parser specification.
Created attachment 231331 [details]
Updated specification w/ example script which is being tested
Note: the example script included there is actually overly complex; it's doing the work of 3 different event handlers: * main server start * replication queue server start * node transition (node up) The script language despite being fairly complex allows a whole lot of flexibility. For example, a 'follows-push-away' logic could now be added trivially to rgmanager by customers. Created attachment 231411 [details]
Patch against RHEL5
TODO: * User event processing (e.g. clusvcadm -r service) * Relocate operation (relocate-or-migrate) * Migration detection on service start Created attachment 231421 [details]
Default catch-all script
TODO:
* Make this the default catch-all. Currently not part of the patch; install in
/usr/share/cluster and place:
<event name="catchall" priority="100"
file="/usr/share/cluster/default_event_script.sl"/> in cluster.conf
Possibility of adding email-notification API to script language Created attachment 253261 [details]
Event scripting 0.7 - RHEL5
Created attachment 253271 [details]
Updated specification
rgmanager event scripting "RIND" v0.7 RIND is not dependencies. Patch is against current RHEL5 branch of rgmanager and should apply. Chances since 0.5 include: * User request handling is centralized * Recovery is centralized Todo: * Migration * More testing * clusvcadm doesn't get correct return codes yet * Copyright / license stuff. It all falls under the GPL v2, though. Requirements: * You need to install slang and slang-devel to build with this patch. Pushed to RHEL4 git branch An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0791.html |