1844500 – unfencing failed after node have rejoined cluster (fence_scsi)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1844500 - unfencing failed after node have rejoined cluster (fence_scsi)

Summary: unfencing failed after node have rejoined cluster (fence_scsi)

Keywords:
Status:	CLOSED DUPLICATE of bug 1797579
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-05 14:35 UTC by michal novacek
Modified:	2022-09-13 15:16 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-18 20:31:05 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
'pcs cluster report' taken after the problem occured (332.85 KB, application/x-bzip) 2020-06-05 14:35 UTC, michal novacek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-45770	0	None	None	None	2022-09-13 15:16:24 UTC
Red Hat Knowledge Base (Solution)	6534651	0	None	None	None	2022-09-13 15:12:45 UTC

Description michal novacek 2020-06-05 14:35:31 UTC

Created attachment 1695447 [details]
'pcs cluster report' taken after the problem occured

Description of problem:

We have automated test that does the following:

 1) install rhel
 2) configure wwo node cluster with fence_scsi and auto tie-breaker
 3) cause network split using iptables
 4) second node is evicted from the cluster and have its scsi key removed
 5) verify that second node cannot write to shared storage
 6) second is rebooted manually
 7) wait for second node to rejoin cluster
 8) verify that the second node keys is re-added to shared storage
 9) verify that second node can write again to the shared storage

In rare case the node it happens that victim node unfencing fails it's scsi key
is not added to the shared disk and it is unable to to write to shared storage
even though it makes part of the cluster.

The problem happens in about one run out of hundred tests. 


Version-Release number of selected component (if applicable):
pacemaker-2.0.3-5.el8_2.1.x86_64
corosync-3.0.3-2.el8.x86_64
fence-agents-all-4.2.1-41.el8.x86_64

How reproducible: very rare, ~2% of runs 

Steps to Reproduce:
1. see description of the problem

Actual results: node not unfenced after it rejoins the cluster

Expected results: node unfenced after re-joining the cluster after being evicted

Additional info:
I never saw the problem happening on cluster start it always happened at point
8) after the node tries to rejoin the cluster.

Comment 1 Ken Gaillot 2020-06-18 20:31:05 UTC

This is Bug 1797579 -- a failed meta-data command can leave the cluster not realizing a fence device is capable of unfencing.

When the problem occurs, virt-213 is DC and schedules virt-214 for unfencing. virt-213's fencer queries both nodes to find out what devices they have that are capable of it. However virt-214's meta-data command when registering fence_scsi had failed:

Jun 04 16:08:37 virt-214.cluster-qe.lab.eng.brq.redhat.com pacemaker-fenced    [1535] (action_synced_wait)      warning: fence_scsi_metadata_1:1894 - timed out after 5000ms

Therefore, virt-214 did not know fence_scsi could unfence, and did not reply with any devices. virt-213 did have the meta-data, and since the meta-data requires unfencing to be performed on the target, it could not perform the unfencing either.

Per the original bz, we will need a better log message when the meta-data command fails, and hopefully we can also retry failed meta-data commands when meta-data is needed later.

Why fence_scsi timed out for a meta-data command is another question, outside of pacemaker. There's no indication in the logs that anything was going wrong on that node at the time.

*** This bug has been marked as a duplicate of bug 1797579 ***

Note You need to log in before you can comment on or make changes to this bug.