1905820 – LVM-activate: Unexpected behavior when data disks get DID_BAD_TARGET on Passive Node

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1905820 - LVM-activate: Unexpected behavior when data disks get DID_BAD_TARGET on Passive Node

Summary: LVM-activate: Unexpected behavior when data disks get DID_BAD_TARGET on Passi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2102126
TreeView+	depends on / blocked

Reported:	2020-12-09 07:19 UTC by sreekantha
Modified:	2023-05-16 08:04 UTC (History)
CC List:	12 users (show)
Fixed In Version:	resource-agents-4.9.0-30.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2102126 (view as bug list)
Environment:
Last Closed:	2023-05-16 08:04:04 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
var/log/message and pacemaker logs of all the 3 nodes (159.53 KB, application/zip) 2020-12-09 08:26 UTC, sreekantha	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1783	None	open	LVM-activate: Fix return codes	2022-06-23 23:43:53 UTC
Red Hat Issue Tracker	CLUSTERQE-6139	None	None	None	2022-11-11 20:31:02 UTC
Red Hat Issue Tracker	KCSOPP-1649	None	None	None	2022-06-27 21:35:01 UTC
Red Hat Knowledge Base (Solution)	6967600	None	None	None	2022-07-13 18:38:18 UTC
Red Hat Product Errata	RHBA-2023:2735	None	None	None	2023-05-16 08:04:55 UTC

Description sreekantha 2020-12-09 07:19:35 UTC

Description of problem:
Unexpected behavior when data disks get DID_BAD_TARGET on Passive Node

In a 3 node Redhat cluster configured with SQL FCI setup.

After Injecting the DID_BAD_TARGET error on the cluster configured sharedDisks (pRDM) on Passive node  ( node which does not own cluster resources ), Node does not go OFFLINE though the data disks report DID_BAD_TARGET, it shows online.

Moving resources to error injected node leads resources get into "STOPPED" state and did not even get re-started on the other healthy nodes which still had access to the storage. The error logged was :
“Preventing my_lvm from restarting anywhere because of fatal failure (not configured: Volume group[sql_vg] doesn't exist, or not visible on this node!”

As per RHEL cluster behavior, restart of the resources on the healthy node is prevented by the constraint auto-created by the move command . If the constraint is removed before the failure limit, the resources should get re-started on the other healthy nodes. If not, user will also be required to do resource cleanup.


We tried various steps as per the recommendation
a. We attempted to delete the move constraint while the move operation was in progress so that things could be cleaned up before the failover threshold was hit. This step did not help since move command gets aborted in such a case. We also attempted to delete the constraint right after the move command was issued but doing so did not help either.
b. Once the resources went to “STOPPED” state after they were moved to a bad node, mere deletion of constraint + “resource cleanup” did not help. The resources never re-started on a healthy node in spite of the cleanup. We had to exclusively move the resources to a healthy node along with resource cleanup .

Sample sequence of steps which worked :
pcs resource cleanup <resource> ;pcs resource move <resource> <healthy node>

While the sequence of steps helps the user get back the resources in running state, the resources should be auto-restarted on a healthy node. Also the sequence does not seem to be intuitive.

Version-Release number of selected component (if applicable):




Additional info:

After inducing the DID_BAD_TARGET on passive node, tried moving the resources to the DID_BAD_TARGET induced node. After the move command, ,SQL_Cluster resource status shows "Stopped" with the exitreason='Volume group[my_vg] doesn't exist, or not visible on this node!' for the passive node.

1,. In my testbed
Node List:
* Online: [ cluster1 cluster2 cluster3 ]

In this cluster1 is active node.

2. Induced the DID_BAD_TARGET on sharedRDM of the cluster3, time stamp of DID_BAD_TARGET induced. "Sun Sep 27 11:51:20 UTC 2020"

After inducing the DID_BAD_TARGET all the 3 nodes shows online and " sg_persist --in -k -d /dev/sdb" command on cluster1 shows all the 3 registered keys. on the cluster3 the cmmand failed
[root@cluster3 ~]# sg_persist --in -k -d /dev/sdb
DGC VRAID 5003
Peripheral device type: disk
Persistent reservation in: transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x00 [DRIVER_OK]

PR generation=0x0, there are NO registered reservation keys
[root@cluster3 ~]#


3. Moved SQL_Cluster resource to cluster3 (on which DID_BAD_TARGET performed). date and time of this command performed is "Sun Sep 27 11:54:03 UTC 2020"
pcs status shows as follows:
[root@cluster3 ~]# pcs status
Cluster name: rhel8Cluster
Cluster Summary:
* Stack: corosync
* Current DC: cluster3 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
* Last updated: Sun Sep 27 12:18:40 2020
* Last change: Sun Sep 27 11:54:03 2020 by root via crm_resource on cluster1
* 3 nodes configured
* 5 resource instances configured

Node List:
* Online: [ cluster1 cluster2 cluster3 ]

Full List of Resources:
* scsifence (stonith:fence_scsi): Started cluster1
* Resource Group: SQL_Cluster:
* my_lvm (ocf::heartbeat:LVM-activate): Stopped
* my_fs (ocf::heartbeat:Filesystem): Stopped
* sql_ip (ocf::heartbeat:IPaddr2): Stopped
* MSSQL_HA (ocf::mssql:fci): Stopped

Failed Resource Actions:
* my_lvm_start_0 on cluster3 'not configured' (6): call=74, status='complete', exitreason='Volume group[my_vg] doesn't exist, or not visible on this node!', last-rc-change='2020-09-27 11:54:04Z', queued=0ms, exec=108ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@cluster3 ~]#

in the pacemaker logs of cluster3 observing the messages like

Sep 27 11:54:04 cluster3 LVM-activate(my_lvm)[185076]: ERROR: Volume group[my_vg] doesn't exist, or not visible on this node!
Sep 27 11:54:04 cluster3 pacemaker-execd[1741]: notice: my_lvm_start_0:185076:stderr [ ocf-exit-reason:Volume group[my_vg] doesn't exist, or not visible on this node! ]

"error: Preventing my_lvm from restarting anywhere because of fatal failure (not configured: Volume group[my_vg] doesn't exist, or not visible on this node!)"

================================================
OCF_CHECK_LEVEL was set to 20 for file system resource
[root@cluster1 ~]# pcs resource config my_fs
Resource: my_fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/my_vg/my_lv directory=/var/opt/mssql/data fstype=ext4
Operations: monitor interval=20s timeout=40s (my_fs-monitor-interval-20s)
start interval=0s timeout=60s (my_fs-start-interval-0s)
stop interval=0s timeout=60s (my_fs-stop-interval-0s)
monitor interval=61s OCF_CHECK_LEVEL=20 (my_fs-monitor-interval-61s)
[root@cluster1 ~]#

Comment 1 Reid Wahl 2020-12-09 07:32:52 UTC

Thanks for reporting this. It happens because the LVM-activate resource returns OCF_ERR_CONFIGURED if it determines that the volume group doesn't exist. OCF_ERR_CONFIGURED is a fatal error, which prevents the resource from starting **anywhere** until the failure is cleared.

I too found this issue while working on some improvements for BZ1902433. I'm planning to submit a PR, which will likely change the failure code to OCF_ERR_GENERIC, to address this issue.

I think I understand what the resource agent author was trying to do here -- if the volume group doesn't exist, then the resource can't run -- although even then, I think OCF_ERR_INSTALLED would be more appropriate since the VG might be present on another node.

However, this neglects a couple of edge cases, such as:
  - PVs temporarily missing due to an iSCSI connection issue
  - PV with DID_BAD_TARGET error

In cases like that, there could be an error that's transient in nature or only occurring on one node. Currently the resource agent responds to that by preventing the resource from running anywhere after a single failure.


You can find more info about the return codes and how Pacemaker handles them here:
  - https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#_how_are_ocf_return_codes_interpreted

Comment 2 Reid Wahl 2020-12-09 07:45:54 UTC

Reassigning to resource-agents. Not a Pacemaker issue.

Comment 4 sreekantha 2020-12-09 08:26:07 UTC

Created attachment 1737852 [details]
var/log/message and pacemaker logs of all the 3 nodes

Comment 5 sreekantha 2020-12-24 05:17:01 UTC

Any updates on this bug?

Comment 6 Reid Wahl 2021-01-02 10:59:55 UTC

(In reply to sreekantha from comment #5)
> Any updates on this bug?

I have a draft patch at the following URL: https://github.com/ClusterLabs/resource-agents/pull/1600

You're welcome to try it on a test system.

At the time of this writing (as I explain in the current latest comment[1]), there is still more work to be done so that the resource operations work well when the partial_activation attribute is set to true. However, if you leave partial_activation unset (as most users do), then that shouldn't matter to you.

[1] https://github.com/ClusterLabs/resource-agents/pull/1600#issuecomment-753458579

Comment 7 sreekantha 2021-01-22 12:34:41 UTC

 Reid Wahl, 
Since the fix is in progress, until the fix is in place, can we have this documented in KB?

Comment 20 errata-xmlrpc 2023-05-16 08:04:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2735

Note You need to log in before you can comment on or make changes to this bug.