1906999 – fence_gce: method=cycle doesn't complete before fenced node rejoins cluster [RHEL 7]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1906999 - fence_gce: method=cycle doesn't complete before fenced node rejoins cluster [RHEL 7]

Summary: fence_gce: method=cycle doesn't complete before fenced node rejoins cluster [...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	fence-agents
Sub Component:
Version:	7.9
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.9
Assignee:	Oyvind Albrigtsen
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:	1906978
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-12 07:30 UTC by Reid Wahl
Modified:	2024-06-13 23:42 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1906978
Environment:
Last Closed:	2021-10-15 11:20:50 UTC
Target Upstream Version:
Embargoed:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5644441	0	None	None	None	2020-12-12 10:01:29 UTC

Description Reid Wahl 2020-12-12 07:30:50 UTC

+++ This bug was initially created as a clone of Bug #1906978 +++

Description of problem:

fence_gce uses method=cycle by default. (That should probably also be removed as default, but that's a different topic.)

When method=cycle is used, it sends a `reset` call to the Google Cloud API. It then checks the status of the reset action once per second.

The problem is that the Google Cloud API doesn't declare the `reset` call to be a success until long after the node has already been powered back on. This means the node may rejoin the cluster before the fence action is declared a success.

Then, when Pacemaker declares the fence operation a success, the fenced node shuts down Pacemaker and Corosync because it has received notification that it was fenced.

This seems to be purely due to the behavior of the Google Cloud API. The fence agent is written in a logical way, but the API does **not** mark the action as a success as soon as the VM is powered on, as we would expect.

/var/log/messages:
~~~
# Node 1
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Client stonith_admin.1366.66468bec wants to fence (reboot) 'nwahl-rhel7-node2' with device '(any)'
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Requesting peer fencing (reboot) of nwahl-rhel7-node2
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
Dec 11 23:27:15 nwahl-rhel7-node1 stonith-ng[1158]:  notice: gce_fence can fence (reboot) nwahl-rhel7-node2: static-list
Dec 11 23:27:22 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A processor failed, forming new configuration.
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] A new membership (10.138.0.2:169) was formed. Members left: 2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [TOTEM ] Failed to receive the leave message. failed: 2
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [CPG   ] downlist left_list: 1 received
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[1]: 1
Dec 11 23:27:23 nwahl-rhel7-node1 corosync[990]: [MAIN  ] Completed service synchronization, ready to provide service.
...
Dec 11 23:27:36 nwahl-rhel7-node1 corosync[990]: [QUORUM] Members[2]: 1 2
...
Dec 11 23:28:12 nwahl-rhel7-node1 stonith-ng[1158]:  notice: Operation 'reboot' [1367] (call 2 from stonith_admin.1366) for host 'nwahl-rhel7-node2' with device 'gce_fence' returned: 0 (OK)


# Node 2
Dec 11 23:26:44 nwahl-rhel7-node2 systemd: Started Session 1 of user nwahl.
Dec 11 23:27:25 nwahl-rhel7-node2 journal: Runtime journal is using 8.0M (max allowed 365.8M, trying to leave 548.7M free of 3.5G available → current limit 365.8M).
...
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:  notice: Operation reboot of nwahl-rhel7-node2 by nwahl-rhel7-node1 for stonith_admin.1366: OK
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
Dec 11 23:28:12 nwahl-rhel7-node2 stonith-ng[1106]: warning: Can't create a sane reply
Dec 11 23:28:12 nwahl-rhel7-node2 crmd[1110]:    crit: We were allegedly just fenced by nwahl-rhel7-node1 for nwahl-rhel7-node1!
Dec 11 23:28:12 nwahl-rhel7-node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure
~~~


API logs:
~~~
# # 23:27:16: Reset signal sent
{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      ...
    },
    "requestMetadata": {
      "callerIp": "35.203.138.0",
      "callerSuppliedUserAgent": "google-api-python-client/1.6.3 (gzip),gzip(gfe)",
      "callerNetwork": "//compute.googleapis.com/projects/cee-ha-testing/global/networks/__unknown__",
      "requestAttributes": {
        "time": "2020-12-11T23:27:16.726991Z",
        "auth": {}
      },
      "destinationAttributes": {}
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "v1.compute.instances.reset",
    "authorizationInfo": [
      ...
    ],
    "resourceName": "projects/cee-ha-testing/zones/us-west1-a/instances/nwahl-rhel7-node2",
    "request": {
      "@type": "type.googleapis.com/compute.instances.reset"
    },
    "response": {
      "zone": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a",
      "selfLinkWithId": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a/operations/3635313982828035771",
      "targetLink": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a/instances/nwahl-rhel7-node2",
      "targetId": "4552659091029868098",
      "startTime": "2020-12-11T15:27:16.595-08:00",
      "id": "3635313982828035771",
      "progress": "0",
      "user": "900909399857-compute.com",
      "operationType": "reset",
      "insertTime": "2020-12-11T15:27:16.585-08:00",
      "name": "operation-1607729236249-5b638a2058d82-2c72c252-8c026005",
      "status": "RUNNING",
      "@type": "type.googleapis.com/operation",
      "selfLink": "https://www.googleapis.com/compute/v1/projects/cee-ha-testing/zones/us-west1-a/operations/operation-1607729236249-5b638a2058d82-2c72c252-8c026005"
    },
    "resourceLocation": {
      "currentLocations": [
        "us-west1-a"
      ]
    }
  },
  "insertId": "sb6pjne1w4x6",
  "resource": {
    "type": "gce_instance",
    "labels": {
      "instance_id": "4552659091029868098",
      "project_id": "cee-ha-testing",
      "zone": "us-west1-a"
    }
  },
  "timestamp": "2020-12-11T23:27:16.297668Z",
  "severity": "NOTICE",
  "logName": "projects/cee-ha-testing/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "operation-1607729236249-5b638a2058d82-2c72c252-8c026005",
    "producer": "compute.googleapis.com",
    "first": true
  },
  "receiveTimestamp": "2020-12-11T23:27:17.078720089Z"
}


# # 23:27:21 to 23:27:25: Node boots back up
# # There are Google Cloud logs for this, but I'm
# # omitting them for space.


# # 23:28:11: Reset action completes
{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      ...
    },
    "requestMetadata": {
      "callerIp": "35.203.138.0",
      "callerSuppliedUserAgent": "google-api-python-client/1.6.3 (gzip),gzip(gfe)",
      "callerNetwork": "//compute.googleapis.com/projects/cee-ha-testing/global/networks/__unknown__",
      "requestAttributes": {},
      "destinationAttributes": {}
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "v1.compute.instances.reset",
    "resourceName": "projects/cee-ha-testing/zones/us-west1-a/instances/nwahl-rhel7-node2",
    "request": {
      "@type": "type.googleapis.com/compute.instances.reset"
    }
  },
  "insertId": "-oeldhxd4im2",
  "resource": {
    "type": "gce_instance",
    "labels": {
      "instance_id": "4552659091029868098",
      "zone": "us-west1-a",
      "project_id": "cee-ha-testing"
    }
  },
  "timestamp": "2020-12-11T23:28:11.418632Z",
  "severity": "NOTICE",
  "logName": "projects/cee-ha-testing/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "operation-1607729236249-5b638a2058d82
~~~

-----

Version-Release number of selected component (if applicable):

fence-agents-gce-4.2.1-41.el7_9.2

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Configure a fence_gce stonith device with method=cycle (the default).
2. Fence a node.

-----

Actual results:

Fenced node reboots and rejoins the cluster before fencing is complete. It shuts itself down when fencing is complete.

-----

Expected results:

The fence action completes as soon as the fenced node has been powered back on.

-----

Additional info:

A workaround is to set method=onoff.

Comment 3 Oyvind Albrigtsen 2021-10-15 11:20:50 UTC

Closing due to issue with softoff: https://bugzilla.redhat.com/show_bug.cgi?id=1942363

Note You need to log in before you can comment on or make changes to this bug.