1942363 – fence_gce: change default method to cycle

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1942363 - fence_gce: change default method to cycle

Summary: fence_gce: change default method to cycle

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	fence-agents
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.5
Assignee:	Oyvind Albrigtsen
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-24 09:22 UTC by Oyvind Albrigtsen
Modified:	2021-11-18 02:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:	fence-agents-4.2.1-70.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-09 17:35:30 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Article)	3479821	None	None	None	2021-03-25 00:19:43 UTC
Red Hat Knowledge Base (Solution)	5644441	None	None	None	2021-03-25 00:19:43 UTC
Red Hat Product Errata	RHBA-2021:4148	None	None	None	2021-11-09 17:35:52 UTC

Description Oyvind Albrigtsen 2021-03-24 09:22:14 UTC

Description of problem:
In https://bugzilla.redhat.com/show_bug.cgi?id=1906978 we changed method to onoff to try to solve issue with node rejoining the cluster before fencing has completed.

According to PR it has been improved to shorten the time it takes before the API reports the reset to be done.

If it still isnt quick enough the workaround is to either disable pacemakerd start on-boot or use a systemd drop-in file to add a delay before starting it.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
https://github.com/ClusterLabs/fence-agents/pull/389

Comment 1 Oyvind Albrigtsen 2021-03-24 09:25:21 UTC

We need to change it from onoff as the off-action is soft-off, and there's also no off/on-actions for the baremetal instances.

Comment 2 Reid Wahl 2021-03-24 23:41:14 UTC

FWIW, it looks like a startup delay is no longer needed at all for the common case, although it adds an extra safety net in case of unforeseen issues.

If Google is able to guarantee that the node will not boot until after the API returns success for the reset, then we can eliminate the systemd drop-in delay. They never claimed to guarantee that, but in the tests below, it worked out that way.


Fencing was initated at 23:17:47 and completed at 23:17:59.

Mar 24 23:17:47 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: gce_fence2 is eligible to fence (reboot) node2 (aka. 'nwahl-rhel8-node2'): static-list
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: Operation 'reboot' [1840] (call 4 from pacemaker-controld.1326) for host 'node2' with device 'gce_fence2' returned: 0 (OK)
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: Operation 'reboot' targeting node2 on node1 for pacemaker-controld.1326: OK
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Stonith operation 4/1:3:0:739869d8-d4f7-4dad-9be6-2411d9d6dd3c: OK (0)
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1326: OK


The fenced node booted up at 23:18:36.

Mar 24 23:18:36 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020



A second test:

Mar 24 23:36:53 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: gce_fence2 is eligible to fence (reboot) node2 (aka. 'nwahl-rhel8-node2'): static-list
Mar 24 23:37:05 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1326: OK


Mar 24 23:37:40 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020



In both of the above tests, the fenced node booted up about 35 seconds after the API returned "success" for the reset. I'm going to ask Tim in the upstream PR for more details about the implementation and whether we can count on this or not. It sure would be nice not to require the drop-in.

Comment 3 Reid Wahl 2021-03-25 00:19:46 UTC

Added notes about the need to set method=cycle explicitly to the following KB articles. After the fix for this BZ is released in a zStream, both of these articles should be updated to reflect that RHEL 8.4 package releases after fence-agents-gce-x.y.z don't need method=cycle set explicitly (or that they default to method=cycle).
  - A node shuts down Pacemaker after getting fenced and rejoining the cluster on Google Cloud Platform (https://access.redhat.com/solutions/5644441)
  - Installing and Configuring a Red Hat Enterprise Linux 7.6 (and later) High-Availability Cluster on Google Compute Cloud (https://access.redhat.com/articles/3479821)

Comment 9 errata-xmlrpc 2021-11-09 17:35:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (fence-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4148

Note You need to log in before you can comment on or make changes to this bug.