Bug 1942363
| Summary: | fence_gce: change default method to cycle | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Oyvind Albrigtsen <oalbrigt> |
| Component: | fence-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
| Status: | CLOSED ERRATA | QA Contact: | Brandon Perkins <bperkins> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.4 | CC: | agk, bperkins, cluster-maint, fdinitto, nwahl |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | 8.5 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | fence-agents-4.2.1-70.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-09 17:35:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Oyvind Albrigtsen
2021-03-24 09:22:14 UTC
We need to change it from onoff as the off-action is soft-off, and there's also no off/on-actions for the baremetal instances. FWIW, it looks like a startup delay is no longer needed at all for the common case, although it adds an extra safety net in case of unforeseen issues. If Google is able to guarantee that the node will not boot until after the API returns success for the reset, then we can eliminate the systemd drop-in delay. They never claimed to guarantee that, but in the tests below, it worked out that way. Fencing was initated at 23:17:47 and completed at 23:17:59. Mar 24 23:17:47 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: gce_fence2 is eligible to fence (reboot) node2 (aka. 'nwahl-rhel8-node2'): static-list Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: Operation 'reboot' [1840] (call 4 from pacemaker-controld.1326) for host 'node2' with device 'gce_fence2' returned: 0 (OK) Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: Operation 'reboot' targeting node2 on node1 for pacemaker-controld.1326: OK Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Stonith operation 4/1:3:0:739869d8-d4f7-4dad-9be6-2411d9d6dd3c: OK (0) Mar 24 23:17:59 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1326: OK The fenced node booted up at 23:18:36. Mar 24 23:18:36 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020 A second test: Mar 24 23:36:53 nwahl-rhel8-node1 pacemaker-fenced[1322]: notice: gce_fence2 is eligible to fence (reboot) node2 (aka. 'nwahl-rhel8-node2'): static-list Mar 24 23:37:05 nwahl-rhel8-node1 pacemaker-controld[1326]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1326: OK Mar 24 23:37:40 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020 In both of the above tests, the fenced node booted up about 35 seconds after the API returned "success" for the reset. I'm going to ask Tim in the upstream PR for more details about the implementation and whether we can count on this or not. It sure would be nice not to require the drop-in. Added notes about the need to set method=cycle explicitly to the following KB articles. After the fix for this BZ is released in a zStream, both of these articles should be updated to reflect that RHEL 8.4 package releases after fence-agents-gce-x.y.z don't need method=cycle set explicitly (or that they default to method=cycle). - A node shuts down Pacemaker after getting fenced and rejoining the cluster on Google Cloud Platform (https://access.redhat.com/solutions/5644441) - Installing and Configuring a Red Hat Enterprise Linux 7.6 (and later) High-Availability Cluster on Google Compute Cloud (https://access.redhat.com/articles/3479821) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (fence-agents bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4148 |