Bug 1378537 - Send alerts for events implied by fencing
Summary: Send alerts for events implied by fencing
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.3
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-22 16:54 UTC by Ken Gaillot
Modified: 2017-07-18 15:52 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-18 15:52:55 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Ken Gaillot 2016-09-22 16:54:53 UTC
Description of problem: As discussed in Bug 773656, pacemaker sends alerts for fencing events, but not for any resource demote/stop action that is implied by the fencing. Users might be interested in receiving alerts for these, even though they are not actions that were actually taken and completed successfully.


Version-Release number of selected component (if applicable): 7.3


How reproducible: Easily


Steps to Reproduce:
1. Configure a cluster with at least two nodes, at least one resource, and an alert agent.
2. Fence a node that is running a resource.
3. Check the alerts received by the alert agent.

Actual results: An alert is received for the fencing.


Expected results: An alert is received for the fencing, and additionally some sort of alert is received for each demote action and stop action implied by the fencing.


Additional info: If implemented, implied event alerts should be clearly distinguishable from actual events.

Comment 1 Ken Gaillot 2016-09-22 17:00:45 UTC
The main argument against implementing this feature is that alerts are not intended to maintain a picture of the current cluster state, but rather to respond to important events that occur -- and an implied event didn't actually happen.

Additionally, it raises some complications:

* Resource action alerts are normally sent by the node that executed the action. If the node was fenced, we clearly must send the alert from a different node (presumably, the DC). This may not be reliable (has a DC been elected?), and it may be unexpected by alert agent writers (who may want to take some action on the local node).

* Implementation would be simple for fencing initiated by the cluster, but much more complicated for fencing initiated externally (such as by DLM or stonith_admin).

Comment 2 Andrew Beekhof 2016-09-22 22:52:50 UTC
-1

Alerts are for events that occur in the cluster.
They were never intended as a way to maintain a parallel copy of the system state.

In this case, the stops (and demotions) did not happen, so there should be no alert.
I could imagine cluster initiated fencing alerts containing a list of resources that were thought to be active on the target node though.

Comment 3 Jaroslav Kortus 2016-09-27 17:56:28 UTC
I'd say that implied events did happen by definition :). Based on this happening, we can also run the service elsewhere.

Maintain a parallel copy of the system state... Isn't this what customers are doing with their monitoring software? :). We should be able to do that much more reliably, or at least get as close as possible.

From my point of view the only thing that customer cares for is if the HA service is up or not. Fencing might cause service disruption and even complete downtime if there is no other node to host that service (disabled, failcounts,...). In case we do not send any alert at all, the "parallel copy" of their universe still thinks it is up, which effectively masks the failure.

To be sure my service is up even after fencing event has been triggered, I'd have to manually check every service and that's a bit painful.

I can understand that the implementation might be tricky and also the cases that Ken highlighted as being problematic. Andrew, if you can imagine sending a list of resources that were thought to be active, could it possibly be split into several standalone alerts? Having two different alert for service being stopped could be confusing.

Couldn't it be just the node that has executed the fencing that will send out the alerts?

I appreciate your feedback on this and still hope that we can get some sort of resource-fencing alerts in :)

Comment 4 Andrew Beekhof 2016-09-28 11:58:22 UTC
(In reply to Jaroslav Kortus from comment #3)
> I'd say that implied events did happen by definition :). 

Untrue. If the fencing was a combination of network and disk fencing then the processes are almost certainly active.  So it is not a universal truth.

> Based on this
> happening, we can also run the service elsewhere.
> 
> Maintain a parallel copy of the system state... Isn't this what customers
> are doing with their monitoring software? :). We should be able to do that

They are free to re-ask the cluster or the nodes directly at any time.

There is also a mechanism for maintaining a shadow of the CIB already, if thats what they really want.  There is no reason to try and use notifications for this purpose. 

> much more reliably, or at least get as close as possible.
> 
> From my point of view the only thing that customer cares for is if the HA
> service is up or not.

A sum of past notifications is not and should not be a valid mechanism for viewing the current cluster status.  That's what the pcs command line, web UI and REST interfaces are for.  We shouldn't be inventing new mechanisms for this.

> Fencing might cause service disruption and even
> complete downtime if there is no other node to host that service (disabled,
> failcounts,...). In case we do not send any alert at all, the "parallel
> copy" of their universe still thinks it is up, which effectively masks the
> failure.
> 
> To be sure my service is up even after fencing event has been triggered, I'd
> have to manually check every service and that's a bit painful.
> 
> I can understand that the implementation might be tricky and also the cases
> that Ken highlighted as being problematic. Andrew, if you can imagine
> sending a list of resources that were thought to be active, could it
> possibly be split into several standalone alerts? Having two different alert
> for service being stopped could be confusing.
> 
> Couldn't it be just the node that has executed the fencing that will send
> out the alerts?
> 
> I appreciate your feedback on this and still hope that we can get some sort
> of resource-fencing alerts in :)

Comment 5 Ken Gaillot 2017-01-10 22:17:51 UTC
I do not think it is a good idea to send separate alerts for implied actions, but we may be able to include a list of implied actions in the fence alert.

This will not be addressed in the 7.4 timeframe.

Comment 6 Ken Gaillot 2017-07-18 15:52:55 UTC
Between limited development resources and the potential for this to do more harm for good, I'm taking this off our plate


Note You need to log in before you can comment on or make changes to this bug.