Bug 1250314

Summary:	Monitor fence agents with --port-as-ip
Product:	Red Hat Enterprise Linux 8	Reporter:	Marek Grac <mgrac>
Component:	pacemaker	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED WONTFIX	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.0	CC:	cluster-maint, fdinitto, kgaillot, kwenning, mgrac, michele
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-02 21:51:27 UTC	Type:	Feature Request
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marek Grac 2015-08-05 06:37:27 UTC

Situation:

pcs stonith create ipmi fence_ipmilan login=XXX passwd=XXX auth=password pcmk_host_map="foo-01:foo-01-ilo;foo-02:foo-02-ilo" port_as_ip=1

In that case monitoring of fence agent is not working as expected because they have no access to IP address/port. 

Work-around:
* disable monitoring: append pcmk_monitor_action=metadata
* monitor at least one: append ipaddr="foo-01-ilo"
  Pacemaker will add final value at the end so this default value will be replaced.

Expected result:
All fence devices are monitored.

Comment 3 Klaus Wenninger 2015-12-09 17:32:05 UTC

What would the desired result of monitoring all ipmi-devices - that is how I get it - be?
If all are reachable everything fine. If none is reachable monitoring will fail - fine as well. But what if two are reachable and 1 isn't. If we fail monitoring the two working would probably not be used by pacemaker as well. This can't be the desired behaviour.
What about a dynamic-list of devices so that the stonith-RA can report to stonithd which machines it can actually fence.
The targets that are potentially available can go into a separate attribute.

Comment 4 Ken Gaillot 2016-01-19 21:56:15 UTC

Here are the possibilities I see:

1. Configure a separate fence device for each IP. This is the easiest and most flexible solution, and its only drawback is a somewhat messier configuration. It returns a separate result for each IP, and allows individual IPs to be enabled/disabled or have unique options.

2. Klaus' suggestion of fence_ipmilan supporting dynamic-list would be all in the fence_ipmilan agent. I think it would be a large overhead at fence time for the agent to poll all IPs and wait for a response. One possible way around that would be for the monitor action to set node attributes, so at fence time it only has to check the attributes. When port_as_ip=1, the monitor operation could poll all IPs, set a special node attribute for each host saying whether its IPMI is available, and return success as long as any IPMI is available. Then, the list operation would return all hosts with a good node attribute. However, this approach would hide failures until fence time.

3. We could define a new fence agent action, e.g. "portmonitor". If a fence agent declares support for this action, and pcmk_host_map is configured, pacemaker could issue a portmonitor call for each map entry, instead of a single monitor call, and pass the map entry as an environment variable. Pacemaker would need to track the results for each port, and consider ports failed/available rather than the entire device. My gut feeling is this would be a significant project with relatively little benefit over the first option above.

I'm inclined to close this bug WONTFIX if we go with option 1, reassign to fence-agents if we go with option 2, or leave this as a low-priority RFE if we go with option 3. Marek, what do you think?

Comment 5 Marek Grac 2016-01-20 09:06:29 UTC

1. It is same as without --port-as-ip

I like the third option the most. It put thing together, pcmk_host_map is pacemaker feature and so it should resolve problem on its side. 

Second option is doable too but it will impact more fence agents than just fence_ipmilan. But it will be just a wrapper to run fence agent multiple times. There can be issues with timeouts because we can have nodes*standard-check time used. When third option is used, we can do that connections in parallel.

Comment 6 Klaus Wenninger 2016-01-20 16:15:18 UTC

ad2)

When suggesting the dynamic-list I had of course implied that if would be filled
prior to the actual fencing event as suggested by ken to not delay that
unnecessarily. Sorry for not stating that...
Of course the RA can do that memorizing in node-attributes but the reason why
I didn't suggest to move the issue over to fence_ipmilan was that I wanted to
think over what we might be able to add to pacemaker to get some descriptive,
standardized reporting/logging of what is working and what not. Such an
interface could then of course do the memorizing as well as the knowledge is passed anyway.
That would somehow take up the suggestions from 3) as well. But to be honnest
as a generic aproach I would leave it up to the RA to split the work in single
actions or do it in one bigger action (some hardware might have the whole info
available right away and there is no sense in asking for the parts bit by bit).
The RA might then decide if it wants to check one by one in a loop if that is
fast enough, do just one per call and switch to the next for the next call,
open up parallel processes to do it in parallel or just query a daemon for
its collection of results.
For the last some help of pacemaker would have to be provided to fire up this
daemon when pacemaker starts the stonith-device and to shut it down when
pacemaker stops the stonith-device - a little bit more like the interface for
non stonith-resources.
In general I find this kind of daemon-aproach appealing as such a daemon could
keep tcp-connections to the fencing-device open and renew them from time to time
(usually they wouldn't stay open for eternity but can be kept open for multiple
transactions). Like this very time accurate information about fencing-devices
being available or not can be generated without high-frequency polling.

Comment 7 Marek Grac 2016-01-20 16:28:28 UTC

@Klaus:

in case of normal 'ports' it is handled by fence agent (as you suggest) but in case of --port-as-ip it cannot be handled in one request as several IP address have to be used.

Comment 8 Klaus Wenninger 2016-01-20 16:47:45 UTC

@Marek:

In this special case you are right of course. I meant one request from 
pacemaker-side.
When breaking it down to the device-specific handling the fence agent 
might break it into more than one request and schedule these as described
above.
But as we wouldn't like to bring something into pacemaker that is proprietary
to ipmi or --port-as-ip a device that can tell about its ability to fence
each device of a larger list in one request is definitely imaginable.
So in my picture this would be the generic case that is matched to the
ipmi proprietary case by the fence agent and not the other way round where
a fence agent for this fiction device would have to collect multiple calls
from pacemaker ...

Comment 9 Ken Gaillot 2016-01-20 17:16:05 UTC

I don't think we can do it with a single request from the pacemaker side because of the result issue you first raised -- we need individual pass/fail results for each port.

I was thinking the fence agent could signal its support for per-port results by supporting either a new action (e.g. portmonitor) or perhaps a metadata flag (e.g. monitor-ports=true), and then pacemaker would call the action once per port. Since pacemaker only looks at the agent's exit status, I don't see any way to get per-port results from a single call.

Comment 10 Klaus Wenninger 2016-01-20 17:33:44 UTC

There is already an interface to ask the fence agent for a list of currently
fencable devices. And there is even a caching mechanism implemented if I
remember that part in the code correctly. What we would be missing to
trigger would be updating this cache periodically. 
Probably it would make sense to make this cache-persistency-time confiurable 
on a per stonith-resource basis. 
If pacemaker would get both a list of nodes that a stonith-resource
should be able to fence and a list that it thinks it can fence right at
the moment then pacemaker can log/report (potentially via plugin scripts
referring to rhbz#773656) this discrepancy accordingly.

Comment 11 Marek Grac 2016-01-21 08:54:25 UTC

I'm lost a bit. So, I will try to summary what I understand.

Fence agents have 'monitor' action that tests if fence device is working. Also we have 'list' ('list-status') which tels us which ports are available (and on/off). Complexity comes with option --port-as-ip that allows to enter ports (plugs/VM UUID/...) even for fence agents which are used for single devices only (ilo, ipmi, drac, ...). As long as we are outside pacemaker, we are fine. 'Monitor' is executed for entered port/IP address.

Pacemaker calls monitor action on defined fence agent. It do not pass any pcmk_* variables because all information are defined directly for fence agent. With --port-as-ip IP address is not entered in fence agent configuration but it has to be obtained from pcmk_host_map. Currently, such information is not exported to fence agent. 

---

imho we are mixing fence agent-device (APC, iLO) with port-based features (plug ABC is ON). Problem with --port-as-ip is that even when we talk about port, we are talking about IP address.

Comment 12 Ken Gaillot 2016-01-21 16:22:02 UTC

I wasn't familiar with list-status. Pacemaker currently does not use that action.

The difficulties from pacemaker's point of view are (1) --port-as-ip is a fence agent parameter, about which pacemaker has no intelligence; and (2) a monitor result currently marks an entire device as usable or not, but here we want to mark individual ports as usable or not.

Perhaps, if a fence agent advertises support for list-status, we could use that instead of monitor. My main concern there is the documentation says that on some devices it can take a long time, so I think we'd want a new metadata option to control whether to use it or not. My next concern would be whether the output format is standardized/reliable.

Klaus' idea (using dynamic-list to report what ports are active) is another possibility. However, currently when pcmk_host_map is used, the fence agents in those cases do not know node names, only port names, so the check is forced to be static-list. We could potentially change that so when dynamic-list is used, the agent reports port names, and pacemaker reverse-maps those to node names. We'd have to maintain backward compatibility though, so maybe we'd need a new check type, e.g. dynamic-port-list. My main concern here is that recurring monitors would still be a problem, and we would not discover problems until fence time.

Alternatively, my idea was to support a new metadata option, say monitor-ports, and if set, pacemaker will call the monitor action once per port, passing the port in an environment variable.

Finally, another possibility (and the simplest) would be for pacemaker to pass the list of ports from pcmk_host_map to the agent's monitor action in an environment variable. Then, the agent can monitor them all. The only drawback here is we get a single result, so the entire device gets marked as available or failed. It might be acceptable to return success as long as any port is available, and simply log warnings about unavailable ports. We wouldn't discover problems until fence time, but it's probably the easiest step to take.

Comment 13 Marek Grac 2016-01-21 16:45:45 UTC

* list-status are advertised in XML metadata in <actions>.

But very likely it won't help you. During the monitoring you don't care whether machine is ON/OFF, all you care about is if the fence device answer your request. 'list-status' may took long time because for some devices we have to obtain a list of plugs and then query each of these plugs (= N requests). 

But if agent has --port-as-ip then it does not have 'list' (and 'list-status') action because this option is used exclusively with devices that covers only one node (e.g. ipmi, ilo). In this case, 'monitor' just check if can get power status of node. Again, we don't care if it is ON/OFF.
------------
* dynamic-list

Fence devices are not able to return you the name of plug in every case (e.g. ipmi does not have any name inside as it is used with single node; other agents have just numbers without human-readable aliases).
-----------
* So, question. Previously, 'monitor' checks if fence device is working. You want to change it to check if 'port' is ON?

imho monitor-ports will be very same to check if 'port_as_ip' is entered. But I would prefer to pass it via standard way instead of env [but this is just technicality]
-----------
* last idea

If we would not discover problems until fence time -> then monitoring is broken; we can put there pcmk_action_monitor=metadata ; We will have to stop when there is a node that we cannot fence.

Comment 14 Klaus Wenninger 2016-01-21 17:01:44 UTC

Ken I like your last suggestion and would suggest to combine it with the
dynamic list:
Monitor gives the port-list, the agent returns the list of working ports
via stdout and the result is fed into the dynamic-list by pacemaker.
Doing this pacemaker can already compare the port-list it provided with 
the list returned, map that to nodes and do a standardized logging so
that we don't get different things to the logs for different agents.
It just does a negative return if it e.g. has some connection problem
so that it thinks it can't aquire the status of the port or things like
that. The returned list goes to the cache in pacemaker so that it is
usually already available when fencing is started. Thus we wouldn't
have to care for that all taking a bit longer.

Comment 15 Ken Gaillot 2016-01-21 19:55:14 UTC

Marek, I agree that monitor should only check whether the device/port is available, not "on" in the sense of a power outlet; and any new variable should be passed the usual way as input.

Klaus, as far as I know, the dynamic-list output is not cached, but always queried at fence time. Also, each node is queried separately for the devices it can access and can fence the target.

Maybe it's just a bad idea to specify multiple IPMI IPs with a single fencing device, because they are in fact independent devices. Maybe what we're really looking for here is a configuration shortcut for creating similar fence devices.

Comment 16 Klaus Wenninger 2016-01-21 21:44:16 UTC

I remember that I made the caching time configurable via a parameter passed
from the agent in a product I had built based on 1.1.10. But let me check
the code how it is done at the moment and to check if my memory doesn't betray
me here.
Maybe IPMI is actually not the best example for where multiple nodes are fenced
by one device. You just achieve to have less stonith-devices. If you think of a
switch with multiple ethernet ports or a switchable multiple socket outlet 
this is more where physical reality inspires this.

Comment 17 Klaus Wenninger 2016-01-22 08:23:26 UTC

quickly checked regarding caching:
can_fence_host_with_device fills a list of targets in the device structure
if target_age is < 60s. There might still be a mechanism calling it again
directly before fencing in any case which I don't see right now but 
basically caching is implemented.

Comment 18 Ken Gaillot 2016-01-22 18:13:27 UTC

Ah, somehow I missed the 60s. That makes sense now.

With IPMI (when used with multiple IPs), one configured fence device actually represents multiple physical devices, so the availability of each one is independent. Thus, I'm thinking that our representation is incorrect. Perhaps one configured fence device should represent one real device, precisely so the monitor result makes sense.

Here's another good reason to use multiple devices instead of one: If we configure a single device that monitors all IPs, the host that runs it will wind up monitoring its own IPMI, which is less than ideal. If we configure multiple devices, negative location constraints can ensure each host doesn't monitor its own IPMI.

I believe this same issue is the reason for a request on the mailing list (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022164.html) where someone wanted to clone a pdu fence device. However I'm not sure cloning is a good approach. Cloning with globally-unique=true and clone-max=(number of pcmk_host_map entries) would be better, but still feels not quite right.

Comment 19 Klaus Wenninger 2016-01-22 18:22:44 UTC

As said IPMI isn't the best example actually ;-)
Anyway what I wanted to ask is why you wouldn't want to monitor your
own IPMI? When you see glitches in the monitoring you can use that
info still to narrow them for e.g. down to problems with links between
rooms/datacenters or whatever ...

Comment 20 Ken Gaillot 2016-01-22 19:22:16 UTC

The question of this bz is how can we monitor multiple IPMI IPs with a single configured fence device.

When the DC chooses a node to execute a fencing action, it prefers the node that is monitoring the device. However, a node can't fence itself unless there are no other options, so even if it's monitoring the device that fences it, some other node will be used if possible. So it would be better if another node were doing the monitoring.

And from a real-world perspective, if IPMI is not functional, there's a good chance that machine isn't functional either (e.g. power supply failure).

Comment 21 Ken Gaillot 2016-01-25 23:16:27 UTC

After discussing with Andrew Beekhof, I think the following approach will be good. A key point to remember is that monitor failures do not prevent a fence device from being used; a failure simply shows up in status output, and the node loses its preference to execute the device.

We can add a new option pcmk_monitor_ports. Default is false, and follows the current behavior (one monitor operation for the entire device).

If set to true, the cluster will run one monitor operation per target, and pass the target (appropriately mapped if pcmk_host_map is used) as an input variable.

A small difference from how targets are passed to other fence actions will be when port lists are used for a single target. Port lists aren't used with IPMI, so it's largely irrelevant here, but an example usage is pcmk_host_map="node1:1,2" where "1,2" is a list of ports required to fence node1. Whereas a reboot in such a case would generate two reboot operations (for ports 1 and 2 separately), a monitor will generate a single monitor operation (with "1,2" as the target list). There will still be multiple monitors, one per target, but any one target will get only one monitor, even if it has multiple ports listed.

The monitor operation will be considered failed if any of its sub-monitors fails.

This approach still has the drawback of a node monitoring its own fence device, and any sub-monitor failure voids verified access to all targets, but that don't prevent fencing from working, and the user has the option of configuring separate devices if they really want to avoid that.

Comment 24 Ken Gaillot 2016-05-25 15:03:33 UTC

This won't be ready in the 7.3 timeframe

Comment 25 Ken Gaillot 2017-01-16 18:00:46 UTC

Capacity constrained, will reconsider for 7.5

Comment 26 Ken Gaillot 2017-10-09 17:54:46 UTC

Due to time constraints, this will not make 7.5

Comment 27 Ken Gaillot 2018-12-03 22:58:49 UTC

Moving to RHEL 8

Comment 30 Ken Gaillot 2020-10-02 21:51:27 UTC

Due to developer time constraints and the availability of a workaround (defining a separate device for each target), this issue has been reported upstream, and this report will be closed.