Bug 1942363

Summary:	fence_gce: change default method to cycle
Product:	Red Hat Enterprise Linux 8	Reporter:	Oyvind Albrigtsen <oalbrigt>
Component:	fence-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	Brandon Perkins <bperkins>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.4	CC:	agk, bperkins, cluster-maint, fdinitto, nwahl
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.5	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	fence-agents-4.2.1-70.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-09 17:35:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oyvind Albrigtsen 2021-03-24 09:22:14 UTC

Description of problem:
In https://bugzilla.redhat.com/show_bug.cgi?id=1906978 we changed method to onoff to try to solve issue with node rejoining the cluster before fencing has completed.

According to PR it has been improved to shorten the time it takes before the API reports the reset to be done.

If it still isnt quick enough the workaround is to either disable pacemakerd start on-boot or use a systemd drop-in file to add a delay before starting it.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
https://github.com/ClusterLabs/fence-agents/pull/389

Comment 1 Oyvind Albrigtsen 2021-03-24 09:25:21 UTC

We need to change it from onoff as the off-action is soft-off, and there's also no off/on-actions for the baremetal instances.

Comment 2 Reid Wahl 2021-03-24 23:41:14 UTC

FWIW, it looks like a startup delay is no longer needed at all for the common case, although it adds an extra safety net in case of unforeseen issues.

If Google is able to guarantee that the node will not boot until after the API returns success for the reset, then we can eliminate the systemd drop-in delay. They never claimed to guarantee that, but in the tests below, it worked out that way.


Fencing was initated at 23:17:47 and completed at 23:17:59.

Mar 24 23:17:47 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: gce_fence2 is eligible to fence (reboot) node2 (aka. 'nwahl-rhel8-node2'): static-list
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: Operation 'reboot' [1840] (call 4 from pacemaker-controld.1326) for host 'node2' with device 'gce_fence2' returned: 0 (OK)
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: Operation 'reboot' targeting node2 on node1 for pacemaker-controld.1326: OK
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Stonith operation 4/1:3:0:739869d8-d4f7-4dad-9be6-2411d9d6dd3c: OK (0)
Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1326: OK


The fenced node booted up at 23:18:36.

Mar 24 23:18:36 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020



A second test:

Mar 24 23:36:53 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: gce_fence2 is eligible to fence (reboot) node2 (aka. 'nwahl-rhel8-node2'): static-list
Mar 24 23:37:05 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1326: OK


Mar 24 23:37:40 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020



In both of the above tests, the fenced node booted up about 35 seconds after the API returned "success" for the reset. I'm going to ask Tim in the upstream PR for more details about the implementation and whether we can count on this or not. It sure would be nice not to require the drop-in.

Comment 3 Reid Wahl 2021-03-25 00:19:46 UTC

Added notes about the need to set method=cycle explicitly to the following KB articles. After the fix for this BZ is released in a zStream, both of these articles should be updated to reflect that RHEL 8.4 package releases after fence-agents-gce-x.y.z don't need method=cycle set explicitly (or that they default to method=cycle).
  - A node shuts down Pacemaker after getting fenced and rejoining the cluster on Google Cloud Platform (https://access.redhat.com/solutions/5644441)
  - Installing and Configuring a Red Hat Enterprise Linux 7.6 (and later) High-Availability Cluster on Google Compute Cloud (https://access.redhat.com/articles/3479821)

Comment 9 errata-xmlrpc 2021-11-09 17:35:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (fence-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4148