1169007 – [RFE] New fence agent fence_beaker

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1169007 - [RFE] New fence agent fence_beaker

Summary: [RFE] New fence agent fence_beaker

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	fence-agents
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	pre-dev-freeze
Target Release:	---
Assignee:	Marek Grac
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-28 20:39 UTC by Jan Pokorný [poki]
Modified:	2016-10-10 14:17 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-10-10 14:17:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jan Pokorný [poki] 2014-11-28 20:39:35 UTC

Following proposal came from a discussion between me and Marek (CC'd)
and parts of the proposal is based on his deeper insights.

The use case to be addressed
----------------------------

One wants to prepare a multihost task [1] in a way allowing arbitrary
host to forcibly stop/restart arbitrary (other) host involved in
the same task.  Such operation may not be readily available for bare
metal machines, but there is a chance it is in case of virtual machines,
maybe even wrapped in some API method.

In particular, this is something that would be useful for test deployments
of cluster software (as in RHEL High Availability Add-on).  One part of
this suite is a fence-agents [*] component collecting scripts facilitating
exactly this kind of stop/restart (or prevent data/network/bus access,
everything together referred to as act of "fencing") actions backed by
either physical devices/controllers or other software components listening
for this kind of signals.  Hence, provided that beaker controller is
already capable of such manipulations with machines it manages, it would
be great to propagate this functionality back to the managed hosts
themselves (restricting the use only for those in the same task for
security reasons), and furthermore, to wrap this in a new fence agent
(likely called fence_beaker) adhering to the respective API [2].


Alternatives
------------

Cluster QA (also CC'd, especially Jarda K.) currently uses a Beaker
customized as per their needs and from what I've heard from them, one
of the extensions is fence_virtd [3] being run along the beaker controller
so that it can listen to (multicasted) requests for fencing triggered via
fence_xvm fence agent.  Hence it would be possible to replace the sketched
out fence_beaker with this provision, but extra daemon, fence_virtd, would
be required "behind the scenes" even when not entirely needed.


What's next
-----------

Folks with knowledge of Beaker internals, please consider, if the proposal
is doable with (existing or new) API of the Beaker, especially the part
exposed to the particular hosts "under control".  Restriction on "only
fellow nodes in the same task can be fenced" could be for the sake of
simplicity implemented by common knowledge of some hash (key) that would
have to be included in the actual fencing request.

If there is no show stopper, once everything would be ready on Beaker
side, I guess Marek could take care of putting the fence_beaker agent
together.  Maybe not the agent itself, perhaps some more suitable API
call wrapper, might then be useful even for non-cluster multihost tasks,
e.g. to reset misbehaving other node quickly, still within the task run.


[1] https://beaker-project.org/docs/user-guide/multihost.html
[2] https://fedorahosted.org/cluster/wiki/FenceAgentAPI
[3] https://fedorahosted.org/cluster/wiki/FenceVirt
    authoritative repo should be https://github.com/ryan-mccabe/fence-virt
[*] IIRC Marek mentioned once that beaker actually depends of fence-agents

Comment 1 Dan Callaghan 2014-11-30 22:17:25 UTC

There is already the rhts-power command which a task can use to power on/off/reboot another system in the recipe set:

https://beaker-project.org/docs/user-guide/task-environment.html#rhts-power

But I guess when you talk about "fencing" you mean just isolating from the network temporarily -- not rebooting the entire system?

Comment 2 Jan Pokorný [poki] 2014-12-01 13:51:14 UTC

Dan, thanks for pointing me to rhts-power.  One question though, does it
take effect instantaneously, without any "gracefulness"?  If so, it would
be a good fit.

> But I guess when you talk about "fencing" you mean just isolating from
> the network temporarily -- not rebooting the entire system?

Basically fencing for me is either hard (power), IO (network and/or
storage), or combined incl. suicides (cf. fence_sanlock).  And I was
looking primarily at the first one, not sure how much trouble would
be with the network isolation nor I need it ATM.

So it looks the hypothetical fence_beaker could utilize rhts-power right
away, which is perfect.  Are there any limitations with rhts-power that
should be considered?  I suppose one cannot accidentally kill a machine
completely unrelated to the job at hand.

Comment 3 Dan Callaghan 2014-12-02 03:39:09 UTC

(In reply to Jan Pokorný from comment #2)
> Dan, thanks for pointing me to rhts-power.  One question though, does it
> take effect instantaneously, without any "gracefulness"?  If so, it would
> be a good fit.

It's not immediate, because the LC only polls for power commands periodically. rhts-power queues the power command on the server, but it doesn't start until the LC picks it up. The polling period is 20 seconds so you can expect the power action to start <= 20 seconds after rhts-power returns.

The power command itself can also take some time -- it depends on many things, like how that particular system is power-controlled and how fast its management controller is (if it has one). S/390 VMs in particular take a while to power cycle.

One other strange thing I have noticed recently, is that sometimes the power off commands *are* graceful. That is, the system seems to get a normal ACPI shutdown signal and systemd cleanly stops services. Normally the Beaker power commands do not have this effect. I suspect that some of the BMCs are trying to be nice to the system by shutting it down cleanly when told to power off. I haven't yet had time to investigate when or why this is happening or whether it's correct or desirable.

> I suppose one cannot accidentally kill a machine
> completely unrelated to the job at hand.

One can, and so care should be taken not to do that. We have an old open RFE for authenticating lab controller API requests: bug 843687.

Comment 4 Marek Grac 2014-12-03 12:57:46 UTC

@Dan:

If any 'standard' fence agent works with 'graceful' power off then it is an error and we can fix it. In some cases, using fence agent is slow because we do not believe those devices (like IPMI) too much and we do:

power off/wait until it is really off/power on/wait until it is really on

this is slower than normal reboot. But our approach allows us to verify if fencing happened or not. What is usually not possible with reboot.

Comment 5 Dan Callaghan 2014-12-05 00:51:00 UTC

Okay, thanks for the info Marek. I will keep an eye out for any systems which are doing a "graceful" shutdown.

Many Beaker systems use the ipmitool power script, which calls "ipmitool power off" directly rather than using the fence_* scripts.

https://git.beaker-project.org/cgit/beaker/tree/LabController/src/bkr/labcontroller/power-scripts/ipmitool

Comment 6 Jan Pokorný [poki] 2014-12-09 15:51:31 UTC

Not to make this whole thing stale for too long, it seems that
principially nothing block creating the proposed fence-beaker agent
(for usage from machines under beaker's control, presumably in
multihost tasks).

So Marek, if you agree, could you please reassign this bug under
fence-agents as a request for fence-beaker?  If there is not enough
a throughput, I could put my hands into that, seems quite a simple
task.

The only thing not doable at this point seem to be "list", "monitor",
and "status" commands.

Note You need to log in before you can comment on or make changes to this bug.