Bug 1905820

Summary:

LVM-activate: Unexpected behavior when data disks get DID_BAD_TARGET on Passive Node

Product:

Red Hat Enterprise Linux 8

Reporter:

sreekantha <sreekantha>

Component:

resource-agents

Assignee:

Oyvind Albrigtsen <oalbrigt>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

8.2

CC:

agk, cluster-maint, fdinitto, jobaker, mjuricek, nwahl, phagara, rraghotham, sdivya, singhrobin, sramasamy, tutikas

Target Milestone:

Keywords:

Triaged

Target Release:

8.0

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

resource-agents-4.9.0-30.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2102126 (view as bug list)

Environment:

Last Closed:

2023-05-16 08:04:04 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2102126

Attachments:

Description	Flags
var/log/message and pacemaker logs of all the 3 nodes	none

Description sreekantha 2020-12-09 07:19:35 UTC

Description of problem:
Unexpected behavior when data disks get DID_BAD_TARGET on Passive Node

In a 3 node Redhat cluster configured with SQL FCI setup.

After Injecting the DID_BAD_TARGET error on the cluster configured sharedDisks (pRDM) on Passive node  ( node which does not own cluster resources ), Node does not go OFFLINE though the data disks report DID_BAD_TARGET, it shows online.

Moving resources to error injected node leads resources get into "STOPPED" state and did not even get re-started on the other healthy nodes which still had access to the storage. The error logged was :
“Preventing my_lvm from restarting anywhere because of fatal failure (not configured: Volume group[sql_vg] doesn't exist, or not visible on this node!”

As per RHEL cluster behavior, restart of the resources on the healthy node is prevented by the constraint auto-created by the move command . If the constraint is removed before the failure limit, the resources should get re-started on the other healthy nodes. If not, user will also be required to do resource cleanup.


We tried various steps as per the recommendation
a. We attempted to delete the move constraint while the move operation was in progress so that things could be cleaned up before the failover threshold was hit. This step did not help since move command gets aborted in such a case. We also attempted to delete the constraint right after the move command was issued but doing so did not help either.
b. Once the resources went to “STOPPED” state after they were moved to a bad node, mere deletion of constraint + “resource cleanup” did not help. The resources never re-started on a healthy node in spite of the cleanup. We had to exclusively move the resources to a healthy node along with resource cleanup .

Sample sequence of steps which worked :
pcs resource cleanup <resource> ;pcs resource move <resource> <healthy node>

While the sequence of steps helps the user get back the resources in running state, the resources should be auto-restarted on a healthy node. Also the sequence does not seem to be intuitive.

Version-Release number of selected component (if applicable):




Additional info:

After inducing the DID_BAD_TARGET on passive node, tried moving the resources to the DID_BAD_TARGET induced node. After the move command, ,SQL_Cluster resource status shows "Stopped" with the exitreason='Volume group[my_vg] doesn't exist, or not visible on this node!' for the passive node.

1,. In my testbed
Node List:
* Online: [ cluster1 cluster2 cluster3 ]

In this cluster1 is active node.

2. Induced the DID_BAD_TARGET on sharedRDM of the cluster3, time stamp of DID_BAD_TARGET induced. "Sun Sep 27 11:51:20 UTC 2020"

After inducing the DID_BAD_TARGET all the 3 nodes shows online and " sg_persist --in -k -d /dev/sdb" command on cluster1 shows all the 3 registered keys. on the cluster3 the cmmand failed
[root@cluster3 ~]# sg_persist --in -k -d /dev/sdb
DGC VRAID 5003
Peripheral device type: disk
Persistent reservation in: transport: Host_status=0x04 [DID_BAD_TARGET]
Driver_status=0x00 [DRIVER_OK]

PR generation=0x0, there are NO registered reservation keys
[root@cluster3 ~]#


3. Moved SQL_Cluster resource to cluster3 (on which DID_BAD_TARGET performed). date and time of this command performed is "Sun Sep 27 11:54:03 UTC 2020"
pcs status shows as follows:
[root@cluster3 ~]# pcs status
Cluster name: rhel8Cluster
Cluster Summary:
* Stack: corosync
* Current DC: cluster3 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
* Last updated: Sun Sep 27 12:18:40 2020
* Last change: Sun Sep 27 11:54:03 2020 by root via crm_resource on cluster1
* 3 nodes configured
* 5 resource instances configured

Node List:
* Online: [ cluster1 cluster2 cluster3 ]

Full List of Resources:
* scsifence (stonith:fence_scsi): Started cluster1
* Resource Group: SQL_Cluster:
* my_lvm (ocf::heartbeat:LVM-activate): Stopped
* my_fs (ocf::heartbeat:Filesystem): Stopped
* sql_ip (ocf::heartbeat:IPaddr2): Stopped
* MSSQL_HA (ocf::mssql:fci): Stopped

Failed Resource Actions:
* my_lvm_start_0 on cluster3 'not configured' (6): call=74, status='complete', exitreason='Volume group[my_vg] doesn't exist, or not visible on this node!', last-rc-change='2020-09-27 11:54:04Z', queued=0ms, exec=108ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@cluster3 ~]#

in the pacemaker logs of cluster3 observing the messages like

Sep 27 11:54:04 cluster3 LVM-activate(my_lvm)[185076]: ERROR: Volume group[my_vg] doesn't exist, or not visible on this node!
Sep 27 11:54:04 cluster3 pacemaker-execd[1741]: notice: my_lvm_start_0:185076:stderr [ ocf-exit-reason:Volume group[my_vg] doesn't exist, or not visible on this node! ]

"error: Preventing my_lvm from restarting anywhere because of fatal failure (not configured: Volume group[my_vg] doesn't exist, or not visible on this node!)"

================================================
OCF_CHECK_LEVEL was set to 20 for file system resource
[root@cluster1 ~]# pcs resource config my_fs
Resource: my_fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/my_vg/my_lv directory=/var/opt/mssql/data fstype=ext4
Operations: monitor interval=20s timeout=40s (my_fs-monitor-interval-20s)
start interval=0s timeout=60s (my_fs-start-interval-0s)
stop interval=0s timeout=60s (my_fs-stop-interval-0s)
monitor interval=61s OCF_CHECK_LEVEL=20 (my_fs-monitor-interval-61s)
[root@cluster1 ~]#

Comment 1 Reid Wahl 2020-12-09 07:32:52 UTC

Thanks for reporting this. It happens because the LVM-activate resource returns OCF_ERR_CONFIGURED if it determines that the volume group doesn't exist. OCF_ERR_CONFIGURED is a fatal error, which prevents the resource from starting **anywhere** until the failure is cleared.

I too found this issue while working on some improvements for BZ1902433. I'm planning to submit a PR, which will likely change the failure code to OCF_ERR_GENERIC, to address this issue.

I think I understand what the resource agent author was trying to do here -- if the volume group doesn't exist, then the resource can't run -- although even then, I think OCF_ERR_INSTALLED would be more appropriate since the VG might be present on another node.

However, this neglects a couple of edge cases, such as:
  - PVs temporarily missing due to an iSCSI connection issue
  - PV with DID_BAD_TARGET error

In cases like that, there could be an error that's transient in nature or only occurring on one node. Currently the resource agent responds to that by preventing the resource from running anywhere after a single failure.


You can find more info about the return codes and how Pacemaker handles them here:
  - https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#_how_are_ocf_return_codes_interpreted

Comment 2 Reid Wahl 2020-12-09 07:45:54 UTC

Reassigning to resource-agents. Not a Pacemaker issue.

Comment 4 sreekantha 2020-12-09 08:26:07 UTC

Created attachment 1737852 [details]
var/log/message and pacemaker logs of all the 3 nodes

Comment 5 sreekantha 2020-12-24 05:17:01 UTC

Any updates on this bug?

Comment 6 Reid Wahl 2021-01-02 10:59:55 UTC

(In reply to sreekantha from comment #5)
> Any updates on this bug?

I have a draft patch at the following URL: https://github.com/ClusterLabs/resource-agents/pull/1600

You're welcome to try it on a test system.

At the time of this writing (as I explain in the current latest comment[1]), there is still more work to be done so that the resource operations work well when the partial_activation attribute is set to true. However, if you leave partial_activation unset (as most users do), then that shouldn't matter to you.

[1] https://github.com/ClusterLabs/resource-agents/pull/1600#issuecomment-753458579

Comment 7 sreekantha 2021-01-22 12:34:41 UTC

 Reid Wahl, 
Since the fix is in progress, until the fix is in place, can we have this documented in KB?

Comment 20 errata-xmlrpc 2023-05-16 08:04:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2735