Bug 1906978

Summary:	fence_gce: method=cycle doesn't complete before fenced node rejoins cluster [RHEL 8]
Product:	Red Hat Enterprise Linux 8	Reporter:	Reid Wahl <nwahl>
Component:	fence-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	Brandon Perkins <bperkins>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.3	CC:	cfeist, cherrylegler, cluster-maint, franklegler
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.0	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	fence-agents-4.2.1-64.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1906999 (view as bug list)		Environment:
Last Closed:	2021-05-18 15:15:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1906999

Description Reid Wahl 2020-12-12 03:47:35 UTC

Description of problem:

fence_gce uses method=cycle by default. (That should probably also be removed as default, but that's a different topic.)

When method=cycle is used, it sends a `reset` call to the Google Cloud API. It then checks the status of the reset action once per second.

The problem is that the Google Cloud API doesn't declare the `reset` call to be a success until long after the node has already been powered back on. This means the node may rejoin the cluster before the fence action is declared a success.

Then, when Pacemaker declares the fence operation a success, the fenced node shuts down Pacemaker and Corosync because it has received notification that it was fenced.

This seems to be purely due to the behavior of the Google Cloud API. The fence agent is written in a logical way, but the API does **not** mark the action as a success as soon as the VM is powered on, as we would expect.

(Note: The logs below are from RHEL 7, but the relevant code hasn't changed in RHEL 8. I'll try to confirm with RHEL 8 and upstream later.)

/var/log/messages:
~~~
# Node 1
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)'
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration.
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG   ] downlist left_list: 1 received
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN  ] Completed service synchronization, ready to provide service.
...
Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
...
Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)


# Node 2
Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl.
Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M).
...
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:  notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366: OK
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply
Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]:    crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
~~~


API logs:
~~~
# # 23:27:16: Reset signal sent
{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      ...
    },
    "requestMetadata": {
      "callerIp": "35.203.138.0",
      "callerSuppliedUserAgent": "google-api-python-client/1.6.3 (gzip),gzip(gfe)",
      "callerNetwork": "//compute.googleapis.com/projects/cee-ha-testing/global/networks/__unknown__",
      "requestAttributes": {
        "time": "2020-12-11T23:27:16.726991Z",
        "auth": {}
      },
      "destinationAttributes": {}
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "v1.compute.instances.reset",
    "authorizationInfo": [
      ...
    ],
    "resourceName": "projects/cee-ha-testing/zones/us-west1-a/instances/nwahl-rhel7-node2",
    "request": {
      "@type": "type.googleapis.com/compute.instances.reset"
    },
    "response": {
      "zone": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a",
      "selfLinkWithId": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a/operations/3635313982828035771",
      "targetLink": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a/instances/nwahl-rhel7-node2",
      "targetId": "4552659091029868098",
      "startTime": "2020-12-11T15:27:16.595-08:00",
      "id": "3635313982828035771",
      "progress": "0",
      "user": "900909399857-compute.com",
      "operationType": "reset",
      "insertTime": "2020-12-11T15:27:16.585-08:00",
      "name": "operation-1607729236249-5b638a2058d82-2c72c252-8c026005",
      "status": "RUNNING",
      "@type": "type.googleapis.com/operation",
      "selfLink": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a/operations/operation-1607729236249-5b638a2058d82-2c72c252-8c026005"
    },
    "resourceLocation": {
      "currentLocations": [
        "us-west1-a"
      ]
    }
  },
  "insertId": "sb6pjne1w4x6",
  "resource": {
    "type": "gce_instance",
    "labels": {
      "instance_id": "4552659091029868098",
      "project_id": "cee-ha-testing",
      "zone": "us-west1-a"
    }
  },
  "timestamp": "2020-12-11T23:27:16.297668Z",
  "severity": "NOTICE",
  "logName": "projects/cee-ha-testing/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "operation-1607729236249-5b638a2058d82-2c72c252-8c026005",
    "producer": "compute.googleapis.com",
    "first": true
  },
  "receiveTimestamp": "2020-12-11T23:27:17.078720089Z"
}


# # 23:27:21 to 23:27:25: Node boots back up
# # There are Google Cloud logs for this, but I'm
# # omitting them for space.


# # 23:28:11: Reset action completes
{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      ...
    },
    "requestMetadata": {
      "callerIp": "35.203.138.0",
      "callerSuppliedUserAgent": "google-api-python-client/1.6.3 (gzip),gzip(gfe)",
      "callerNetwork": "//compute.googleapis.com/projects/cee-ha-testing/global/networks/__unknown__",
      "requestAttributes": {},
      "destinationAttributes": {}
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "v1.compute.instances.reset",
    "resourceName": "projects/cee-ha-testing/zones/us-west1-a/instances/nwahl-rhel7-node2",
    "request": {
      "@type": "type.googleapis.com/compute.instances.reset"
    }
  },
  "insertId": "-oeldhxd4im2",
  "resource": {
    "type": "gce_instance",
    "labels": {
      "instance_id": "4552659091029868098",
      "zone": "us-west1-a",
      "project_id": "cee-ha-testing"
    }
  },
  "timestamp": "2020-12-11T23:28:11.418632Z",
  "severity": "NOTICE",
  "logName": "projects/cee-ha-testing/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "operation-1607729236249-5b638a2058d82
~~~

-----

Version-Release number of selected component (if applicable):

fence-agents-gce-4.2.1-53.el8_3.1

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Configure a fence_gce stonith device with method=cycle (the default).
2. Fence a node.

-----

Actual results:

Fenced node reboots and rejoins the cluster before fencing is complete. It shuts itself down when fencing is complete.

-----

Expected results:

The fence action completes as soon as the fenced node has been powered back on.

-----

Additional info:

A workaround is to set method=onoff.

Comment 2 Reid Wahl 2020-12-12 07:08:53 UTC

Confirmed that the issue still exists on RHEL 8 as expected.

~~~
Dec 12 07:06:17 nwahl-rhel8-node1 pacemaker-fenced[1317]: notice: Requesting peer fencing (reboot) targeting node2
Dec 12 07:06:17 nwahl-rhel8-node1 pacemaker-fenced[1317]: notice: gce_fence is eligible to fence (reboot) node2 (aka. 'nwahl-rhel8-node2'): static-list
...
Dec 12 07:06:30 nwahl-rhel8-node1 corosync[1047]:  [QUORUM] Members[1]: 1
...
Dec 12 07:07:15 nwahl-rhel8-node1 pacemaker-fenced[1317]: notice: Operation 'reboot' [1689] (call 2 from stonith_admin.1688) for host 'node2' with device 'gce_fence' returned: 0 (OK)
...
Dec 12 07:07:15 nwahl-rhel8-node1 corosync[1047]:  [QUORUM] Members[1]: 1


Dec 12 07:07:15 nwahl-rhel8-node2 pacemaker-fenced[1310]: notice: Operation 'reboot' targeting node2 on node1 for stonith_admin.1688: OK
Dec 12 07:07:15 nwahl-rhel8-node2 pacemaker-controld[1314]: crit: We were allegedly just fenced by node1 for node1!
~~~

Comment 5 Reid Wahl 2020-12-22 10:39:16 UTC

Using method=onoff actually isn't a great workaround. I think that if the API worked exactly how we would expect, the decision to use method=cycle as the default would be clearly correct.
- The off signal sends an ACPI shutdown signal, which by default tells the guest OS to shut down gracefully[1]. We may be able to override that by setting HandlePowerKey=ignore, though I recall that not working with AWS. Ideally, the API would provide an "immediate hard shutdown" endpoint. Currently it does not.
- The reset signal sends an immediate hard reset signal[3]. This is what we want, even though we would generally prefer to do that via a hard-power-off followed by a power-on.

We should still strive to improve method=onoff in any way possible, but if we can get cycle working correctly, that's probably our best bet in this case.

The issue, as described in comment 0, is that the reset operation doesn't complete immediately upon the guest OS booting back up. I don't currently know how we can work around this at the fence agent level; workarounds are possible in other aspects of the configuration. We can instruct users to add a GRUB_TIMEOUT or to add a delay to corosync startup via a systemd drop-in file. Additionally, a relatively long delay (e.g., 20 seconds or longer) may be required in order to keep both nodes from fencing each other before one of them gets reset. (This could break down in the event of API connection issues.)

@Frank, I see that you've added yourself to the CC list. Do you have any idea why the instance reset API call takes so long to complete? The guest OS has been booted back up with the corosync service started for about 20 seconds on average, before the status of the reset operation becomes "DONE". For fencing to work as expected, we need the reset operation to be marked "DONE" as soon as the guest OS has begun booting back up.

[1] Stopping an instance (https://cloud.google.com/compute/docs/instances/instance-life-cycle#stopping_an_instance)
[2] 9.13.2. Disabling ACPI Soft-Off in the logind.conf file (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_high_availability_clusters/assembly_configuring-fencing-configuring-and-managing-high-availability-clusters#s2-acpi-disable-logind-CA)
[3] Resetting an instance (https://cloud.google.com/compute/docs/instances/instance-life-cycle#resetting_an_instance)

-----

CC'ing Brandon Perkins (Red Hat) in case he has any ideas about how to approach this problem.

Comment 6 Reid Wahl 2021-01-06 03:05:07 UTC

From Frank via the support case:

~~~
"Hi Reid,

I hope you are doing well. My name is Frank and I am working with Cherry on this case. We have been discussing this issue with our engineering team and have a few updates.

On the reset operation:
The reset operation reports DONE after the VM has been reset, but as you observed not necessarily at the same time. It is doing so by internally checking the status of the actual VM reset. The issue for this use case is that the status checks are asynchronous and only start after about 50 seconds which can lead to the situation that the reset is done already before the operation reports DONE. Even shortening the initial check could still lead to the situation that the VM is reset before the operation reports DONE.

One how to address this:
We discussed this issue with our SAP on Google Cloud engineering team. Given the limitations of monitoring the reset operation, the team is currently investigating options on how we could reliably verify if the reset of the VM was successful. We would like to work on that fix and then push it to the repo https://github.com/ClusterLabs/fence-agents (Tim from that team earlier contributed to the fence_gce.py for some other functionality).

Would this approach be okay with you? If it is okay, I was wondering if the workaround that was earlier suggested could be documented until a fix can be provided (i.e. delay startup of Corosync for 60 seconds)?

Regards,
Frank"
~~~

--------------------

Hi, Frank. Thank you for the insight!

I've updated https://access.redhat.com/solutions/5644441 to document the `ExecStartPre` workaround.

> Even shortening the initial check could still lead to the situation that the VM is reset before the operation reports DONE.

Sure, but if the check is shorter, then it's much less likely that the VM will reset and rejoin the cluster before our query returns the "reset DONE" status. And that of course is what matters. For example, if the reset operation enters "DONE" state two seconds after the VM was reset, then there should be plenty of time to detect that and mark the fence operation as complete before the fenced node finishes booting and rejoins the cluster.

In my testing, it seems to take about 10 to 20 seconds between when a node begins booting and when it starts corosync and rejoins the cluster.

From the initial test on RHEL 7:

    Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
    ...
    Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M)
    ...
    Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
    ...
    Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)


I just ran a couple more tests with RHEL 8.

    # # Boot at 02:44:13, corosync start at 02:44:29
    Jan  6 02:44:13 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020
    Jan  6 02:44:29 nwahl-rhel8-node2 corosync[1056]:  [TOTEM ] kronosnet crypto initialized: aes256/sha256

    # # Boot at 02:46:21, corosync start at 02:46:27
    Jan  6 02:46:21 nwahl-rhel8-node2 kernel: Linux version 4.18.0-240.1.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Fri Oct 16 13:36:46 EDT 2020
    Jan  6 02:46:37 nwahl-rhel8-node2 corosync[1018]:  [TOTEM ] kronosnet crypto initialized: aes256/sha256


So if we were able to get a "reset DONE" status within about 10 seconds of when the VM starts booting up, I think we ought to be good the majority of the time. If we could get a "DONE" status within 5 seconds of the VM booting, we'd be in even better shape.

Would it be at all feasible to start the asynchronous checks sooner? If doing this across the board would break other use cases or cause too much traffic, then maybe we could parameterize it in the reset API call (defaulting to the current behavior)?

-----

> Given the limitations of monitoring the reset operation, the team is currently investigating options on how we could reliably verify if the reset of the VM was successful. We would like to work on that fix and then push it to the repo https://github.com/ClusterLabs/fence-agents. ...
> Would this approach be okay with you? 

That sounds good to me. What approach did you have in mind?
  - Begin the async status checks for the reset operation sooner -- automatically or via an argument to the reset call?
  - Provide a way to force-check the operation status (i.e., an API endpoint that triggers the internal async status check before the time when it would normally start)?
  - Something else?

With a conventional "onoff"-style reboot operation, we would poll the node's power status once per second after sending the "power on" signal. That approach (from the fence agent's end) won't work with fence_gce when method="cycle", because the instances remain in an "ON" power state the whole time. So we really do have to check completion via the reset operation status, rather than the VM's power status.

And as I probably mentioned earlier -- using method="onoff" instead of method="cycle" isn't a great alternative here, because it triggers a graceful shutdown via ACPI before booting back up. We need an immediate hard power-off (which GCE doesn't seem to offer) or an immediate reset (which is what we're using discussing now).

Comment 7 franklegler 2021-01-14 06:01:27 UTC

Thanks for updating the workaround.

The current fence_gce implementation calls the Cloud API reset() method asynchronously and then watches the operation for a DONE status - polling the operations status every 1 second. The underlying Cloud API reset call will not report back DONE on the operation until it can guarantee it is complete. This is what is leading to the increased delay on the fence_gce side. The Cloud API team is currently working to make the DONE status of the reset operation get closer to the actual start of the targeted GCE instance - but we will always have the possibility of the machine being up before the operation reports DONE. This may be below 10 seconds in some cases, but we cannot guarantee it.

Our approach to resolve this will be to continue to monitor the reset operation status from fence_gce but also add an additional mechanism from the GCE instance being reset to be able to report startup in its metadata. When fence_gce sees either the reset operation complete or the instance report started it will return from the reset. This should keep the time between the reset invocation start and end to less than 10 seconds.

Comment 16 franklegler 2021-03-16 16:49:00 UTC

This bug was closed with https://github.com/ClusterLabs/fence-agents/pull/377. Unfortunately, the fix causes issues due to the behaviour of the onoff on Google Cloud and we need to move back to powercycle (https://github.com/ClusterLabs/fence-agents/pull/389).

Here some additional information from my colleague in https://github.com/ClusterLabs/fence-agents/pull/389:
The on/off functionality does not operate like a reset. The API call to off is not a hard power off, this call waits for a clean shutdown from the host. If the host is hung, the off call will never return. The only call that produces a hard power off is the reset call (powercycle). Google also does not offer on/off API calls for the bare metal solution, only the reset api is available.

In the meantime, there were also changes on the google API side to shorten the time between requesting 'reset' and having the API return that the operation is complete. There can still be some time difference, but in most cases it will be much less than what was observed when this change was introduced. Google will still recommend a startup delay for pacemaker.

Comment 18 errata-xmlrpc 2021-05-18 15:15:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (fence-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1745