| Summary: | Send alerts for events implied by fencing | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Ken Gaillot <kgaillot> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.3 | CC: | abeekhof, cluster-maint, jkortus, kwenning |
| Target Milestone: | rc | Keywords: | FutureFeature |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-07-18 15:52:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Ken Gaillot
2016-09-22 16:54:53 UTC
The main argument against implementing this feature is that alerts are not intended to maintain a picture of the current cluster state, but rather to respond to important events that occur -- and an implied event didn't actually happen. Additionally, it raises some complications: * Resource action alerts are normally sent by the node that executed the action. If the node was fenced, we clearly must send the alert from a different node (presumably, the DC). This may not be reliable (has a DC been elected?), and it may be unexpected by alert agent writers (who may want to take some action on the local node). * Implementation would be simple for fencing initiated by the cluster, but much more complicated for fencing initiated externally (such as by DLM or stonith_admin). -1 Alerts are for events that occur in the cluster. They were never intended as a way to maintain a parallel copy of the system state. In this case, the stops (and demotions) did not happen, so there should be no alert. I could imagine cluster initiated fencing alerts containing a list of resources that were thought to be active on the target node though. I'd say that implied events did happen by definition :). Based on this happening, we can also run the service elsewhere. Maintain a parallel copy of the system state... Isn't this what customers are doing with their monitoring software? :). We should be able to do that much more reliably, or at least get as close as possible. From my point of view the only thing that customer cares for is if the HA service is up or not. Fencing might cause service disruption and even complete downtime if there is no other node to host that service (disabled, failcounts,...). In case we do not send any alert at all, the "parallel copy" of their universe still thinks it is up, which effectively masks the failure. To be sure my service is up even after fencing event has been triggered, I'd have to manually check every service and that's a bit painful. I can understand that the implementation might be tricky and also the cases that Ken highlighted as being problematic. Andrew, if you can imagine sending a list of resources that were thought to be active, could it possibly be split into several standalone alerts? Having two different alert for service being stopped could be confusing. Couldn't it be just the node that has executed the fencing that will send out the alerts? I appreciate your feedback on this and still hope that we can get some sort of resource-fencing alerts in :) (In reply to Jaroslav Kortus from comment #3) > I'd say that implied events did happen by definition :). Untrue. If the fencing was a combination of network and disk fencing then the processes are almost certainly active. So it is not a universal truth. > Based on this > happening, we can also run the service elsewhere. > > Maintain a parallel copy of the system state... Isn't this what customers > are doing with their monitoring software? :). We should be able to do that They are free to re-ask the cluster or the nodes directly at any time. There is also a mechanism for maintaining a shadow of the CIB already, if thats what they really want. There is no reason to try and use notifications for this purpose. > much more reliably, or at least get as close as possible. > > From my point of view the only thing that customer cares for is if the HA > service is up or not. A sum of past notifications is not and should not be a valid mechanism for viewing the current cluster status. That's what the pcs command line, web UI and REST interfaces are for. We shouldn't be inventing new mechanisms for this. > Fencing might cause service disruption and even > complete downtime if there is no other node to host that service (disabled, > failcounts,...). In case we do not send any alert at all, the "parallel > copy" of their universe still thinks it is up, which effectively masks the > failure. > > To be sure my service is up even after fencing event has been triggered, I'd > have to manually check every service and that's a bit painful. > > I can understand that the implementation might be tricky and also the cases > that Ken highlighted as being problematic. Andrew, if you can imagine > sending a list of resources that were thought to be active, could it > possibly be split into several standalone alerts? Having two different alert > for service being stopped could be confusing. > > Couldn't it be just the node that has executed the fencing that will send > out the alerts? > > I appreciate your feedback on this and still hope that we can get some sort > of resource-fencing alerts in :) I do not think it is a good idea to send separate alerts for implied actions, but we may be able to include a list of implied actions in the fence alert. This will not be addressed in the 7.4 timeframe. Between limited development resources and the potential for this to do more harm for good, I'm taking this off our plate |