Bug 1854340

Summary: When root volume was unavailable on DC, the node is running but not functioning as expected.
Product: Red Hat Enterprise Linux 8 Reporter: Seunghwan Jung <jseunghw>
Component: pacemakerAssignee: Klaus Wenninger <kwenning>
Status: POST --- QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 8.6CC: cfeist, cluster-maint, kgaillot, khuh, kwenning, msmazova, nwahl, phagara, sbradley
Target Milestone: rcKeywords: Triaged
Target Release: 8.6   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Seunghwan Jung 2020-07-07 08:46:11 UTC
Description of problem:

When root volume was unavailable on DC node, the node is still running as a member but not functioning as expected.


Version-Release number of selected component (if applicable):

pacemaker-2.0.3-5.el8_2.1.x86_64
corosync-3.0.3-2.el8.x86_64
pcs-0.10.4-6.el8_2.1.x86_64


How reproducible:

Always (from my testing)


Steps to Reproduce:


Testing on none DC node.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# pcs constraint --full
Location Constraints:
  Resource: test1
    Enabled on:
      Node: ha8node1 (score:INFINITY) (id:location-test1-ha8node1-INFINITY)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
[root@ha8node2 ~]# 



[root@ha8node1 ~]# watch -n 1 pcs status

Every 1.0s: pcs status                                                                                 ha8node1: Tue Jul  7 16:19:57 2020

Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 16:19:57 2020
  * Last change:  Tue Jul  7 16:19:36 2020 by root via cibadmin on ha8node2
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1 
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Started ha8node1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


[root@ha8node1 ~]# pcs resource config test1 
 Resource: test1 (class=ocf provider=pacemaker type=Dummy)
  Operations: migrate_from interval=0s timeout=20s (test1-migrate_from-interval-0s)
              migrate_to interval=0s timeout=20s (test1-migrate_to-interval-0s)
              monitor interval=5 on-fail=fence timeout=5 (test1-monitor-interval-5)
              reload interval=0s timeout=20s (test1-reload-interval-0s)
              start interval=0s timeout=20s (test1-start-interval-0s)
              stop interval=0s timeout=20s (test1-stop-interval-0s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Disable root volume on node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup suspend rhel-root
[root@ha8node2 ~]# 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


In a few minutes, nothing happened to the cluster: 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 16:21:30 2020
  * Last change:  Tue Jul  7 16:19:36 2020 by root via cibadmin on ha8node2
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]              <============= 

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1 
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Started ha8node1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


enable root volume on node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup resume rhel-root
[root@ha8node2 ~]# 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Still nothing happened to the cluster: 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 16:22:40 2020
  * Last change:  Tue Jul  7 16:19:36 2020 by root via cibadmin on ha8node2
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1 
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Started ha8node1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Disable root volume on node2 again
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# dmsetup suspend rhel-root
[root@ha8node2 ~]# 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Trying to move resource 'test1' to node2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# pcs resource move test1
Warning: Creating location constraint 'cli-ban-test1-on-ha8node1' with a score of -INFINITY for resource test1 on ha8node1.
	This will prevent test1 from running on ha8node1 until the constraint is removed
	This will be the case even if ha8node1 is the last node in the cluster
[root@ha8node1 ~]# 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


'test1' didn't move to node2 for few minutes, but became to 'Stopped', and 'FAILED' status, 
the node2 got fenced in the end and Started on it. It takes sometime but works!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 16:24:49 2020
  * Last change:  Tue Jul  7 16:23:37 2020 by root via crm_resource on ha8node1
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Stopped           <==============

...

later, the resource test1 becomes 'FAILED' status, and to 'Stopped' and the node got fenced.  
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



However, if you test it on DC node. 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 17:03:51 2020
  * Last change:  Tue Jul  7 17:01:49 2020 by root via cibadmin on ha8node1
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Started ha8node2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@ha8node1 ~]# pcs constraint --full
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# dmsetup suspend rhel-root
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Nothing happens for about 10 minutes... 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 17:10:45 2020
  * Last change:  Tue Jul  7 17:01:49 2020 by root via cibadmin on ha8node1
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Started ha8node2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


trying to move the resource 'test1' to node1 which is DC with unavailable root volume.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node2 ~]# pcs resource move test1
Warning: Creating location constraint 'cli-ban-test1-on-ha8node2' with a score of -INFINITY for resource test1 on ha8node2.
	This will prevent test1 from running on ha8node2 until the constraint is removed
	This will be the case even if ha8node2 is the last node in the cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

nothing happens.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
..
Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 17:25:57 2020
  * Last change:  Tue Jul  7 17:11:19 2020 by root via crm_resource on ha8node2
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Started ha8node2

..
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


it would start working again as if there was no issue when the root valume is made available.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@ha8node1 ~]# dmsetup resume rhel-root
[root@ha8node1 ~]# 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster name: ha8_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: ha8node1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Tue Jul  7 17:26:34 2020
  * Last change:  Tue Jul  7 17:11:19 2020 by root via crm_resource on ha8node2
  * 2 nodes configured
  * 7 resource instances configured

Node List:
  * Online: [ ha8node1 ha8node2 ]

Full List of Resources:
  * xvmfence1   (stonith:fence_xvm):    Started ha8node1
  * xvmfence2   (stonith:fence_xvm):    Started ha8node2
  * Resource Group: webservice:
    * VIP	(ocf::heartbeat:IPaddr2):	Started ha8node1
    * WebSite   (ocf::heartbeat:apache):        Started ha8node1
    * lvm	(ocf::heartbeat:LVM-activate):  Started ha8node1
    * cluster_fs        (ocf::heartbeat:Filesystem):    Started ha8node1
  * test1	(ocf::pacemaker:Dummy): Started ha8node1

Failed Resource Actions:
  * VIP_monitor_10000 on ha8node1 'error' (1): call=43, status='Timed Out', exitreason='', last-rc-change='2020-07-07 17:26:24 +09:00', q
ueued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




Actual results:
The DC node with unavailable root volume remains as a member of the cluster but not functioning (unable to host new resources)

Expected results:
The DC node with unavailable root volume should be fenced or removed from the cluster.
Maybe one of none DC node should check if cib on DC is working fine.

Additional info:

This is a similar bug with 1725236 which is fixed.

Comment 2 Ken Gaillot 2020-07-07 15:22:39 UTC
Hi Seunghwan,

This should have been fixed with Bug 1596125; the corresponding fix for RHEL 8 landed via rebase in RHEL 8.1. Can you attach an sosreport from the DC node from when the problem occurred?

Longer explanation:

It is expected behavior that loss of the root volume (or any other disk volume) will not be detected automatically. If that is desired, a cluster resource must be configured to monitor it. There is no agent currently available for that purpose, so it would have be written (you could create an RFE BZ for that if desired). Such an agent would work like the ocf:heartbeat:ethmonitor agent -- it would not mount and unmount the root volume (as ocf:heartbeat:Filesystem would), but would only run a recurring monitor and set a node attribute accordingly. That node attribute could then be used in location constraints if you want to move resources away from a node that loses the monitored disk volume.

Without such a resource, Pacemaker will detect a lost disk volume only if it needs to write to that volume for its own purposes, or if a resource monitor that depends on the volume fails. This should eventually happen if /var/lib/pacemaker is on the volume. The DC node will eventually attempt to write the current CIB to disk, which will fail, then Pacemaker on the DC should immediately exit without restarting, and the other nodes should take over and fence the former DC.

Comment 3 Seunghwan Jung 2020-07-08 00:40:11 UTC
Hi Ken,

Thanks for the update.

Yes, actually the agent you mentioned is one the customer needs in their production environment. I will created a RFE for that.

I found the issue while testing the fix - Bug 1725236(RHEA-2020:1609). 

To collect the sosreport which I will upload, I ran the test again from around 08:33:10 8th of Jul 2020. 
After I ran 'dmsetup suspend rhel-root' on DC node which was ha8node2, the node hang so I turned the power of the node off and on again.
Please let me know if you need any other data.

Regards,
Hwanii

Comment 5 Seunghwan Jung 2020-07-08 00:43:12 UTC
Comment on attachment 1700234 [details]
sosreport from dc

md5sum 018b2c863c806571e27724f3223ceec6

Comment 6 Reid Wahl 2020-07-08 07:16:02 UTC
I think this is caused by a difference in I/O behavior between quiescing the underlying physical volume (/dev/sda in BZ 1725236) and suspending the logical volume with `dmsetup suspend <VG>-<LV>` as in this BZ.

I reproduced QE's test from BZ1725236 (with quiesce) and it worked as expected. (Note that the node only gets fenced if pacemaker-controld is waiting for a response from the scheduler, so we cause a resource failure by removing the Dummy resource's state file.)

~~~
[root@fastvm-rhel-8-0-24 pacemaker]# pcs status
Cluster name: testcluster
Cluster Summary:
  * Stack: corosync
  * Current DC: fastvm-rhel-8-0-23 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Wed Jul  8 00:01:29 2020
  * Last change:  Wed Jul  8 00:01:13 2020 by root via crm_resource on fastvm-rhel-8-0-23
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ fastvm-rhel-8-0-23 fastvm-rhel-8-0-24 ]

Full List of Resources:
  * xvm	(stonith:fence_xvm):	Started fastvm-rhel-8-0-23
  * dummy	(ocf::heartbeat:Dummy):	Started fastvm-rhel-8-0-23

[root@fastvm-rhel-8-0-24 pacemaker]# date && ssh root@fastvm-rhel-8-0-23 'echo quiesce > /sys/class/block/sda/device/state' && ssh root@fastvm-rhel-8-0-23 'rm -f /run/resource-agents/Dummy-dummy.state'
Wed Jul  8 00:07:46 PDT 2020


Jul  8 00:07:55 fastvm-rhel-8-0-24 pacemaker-attrd[435286]: notice: Setting fail-count-dummy#monitor_10000[fastvm-rhel-8-0-23]: (unset) -> 1
Jul  8 00:07:55 fastvm-rhel-8-0-24 pacemaker-attrd[435286]: notice: Setting last-failure-dummy#monitor_10000[fastvm-rhel-8-0-23]: (unset) -> 1594192075

Jul  8 00:09:55 fastvm-rhel-8-0-24 pacemaker-controld[435288]: notice: Our peer on the DC (fastvm-rhel-8-0-23) is dead
Jul  8 00:09:55 fastvm-rhel-8-0-24 pacemaker-controld[435288]: notice: State transition S_NOT_DC -> S_ELECTION
Jul  8 00:09:55 fastvm-rhel-8-0-24 pacemaker-controld[435288]: notice: State transition S_ELECTION -> S_INTEGRATION
Jul  8 00:09:55 fastvm-rhel-8-0-24 pacemaker-schedulerd[435287]: warning: Cluster node fastvm-rhel-8-0-23 will be fenced: peer process is no longer available
Jul  8 00:09:55 fastvm-rhel-8-0-24 pacemaker-schedulerd[435287]: warning: Node fastvm-rhel-8-0-23 is unclean
Jul  8 00:09:55 fastvm-rhel-8-0-24 pacemaker-schedulerd[435287]: warning: Unexpected result (not running: No process state file found) was recorded for monitor of dummy on fastvm-rhel-8-0-23 at Jul  8 00:07:55 2020
Jul  8 00:09:55 fastvm-rhel-8-0-24 pacemaker-schedulerd[435287]: warning: Scheduling Node fastvm-rhel-8-0-23 for STONITH
~~~


However, when I try to repeat the test using `dmsetup suspend`, everything just hangs:

~~~
# # Node fastvm-rhel-8-0-23 is DC

[root@fastvm-rhel-8-0-24 pacemaker]# date && ssh root@fastvm-rhel-8-0-23 '/usr/sbin/dmsetup suspend r8vg-root_lv' && ssh root@fastvm-rhel-8-0-23 'rm -f /run/resource-agents/Dummy-dummy.state'
Wed Jul  8 00:11:35 PDT 2020

<hangs indefinitely and does not return to prompt; node 23 has not been fenced after waiting five minutes>
~~~

Comment 7 Ken Gaillot 2020-07-29 23:08:56 UTC
I reproduced the issue a couple of times -- once with remote syslogging enabled on the target node, and once with /var/log/pacemaker on a separate logical volume. With remote syslogging, everything appeared to freeze, and I never got any logs after the suspend. With logging to a separate volume, I could see the logs continuing to come in, and the node was fenced after the fix in Bug 1596125 kicked in.

I am not sure if the logging difference was related to the difference in outcome, or if that was just a timing issue somewhere. We could try reproducing both ways a bunch of times to see, but it doesn't really change the problem.

The controller already has timeouts on various aspects of communication (reading the CIB, contacting the scheduler, etc.) that should prevent this situation in most cases, but I suspect there is some disk I/O (possibly logging) that does not have a timer and is hanging.

We already have Bug 1707851 to make pacemakerd monitor its subdaemons more thoroughly. With that, pacemakerd should be able to tell that the controller is not healthy, and take an action such as killing and respawning it, or exiting and staying down, which should lead to fencing.

However if pacemakerd is also stuck, the issue will remain. If the host uses sbd for fencing, then the fix in Bug 1718324 will allow sbd to detect the problem and panic the host (even if sbd itself hangs too). That leaves the situation where both pacemakerd and the controller are stuck but sbd is not in use.

For that situation, one possibility would be to establish a heartbeat between the DC and non-DC controllers, and if a non-DC controller doesn't receive a heartbeat in time, it would call for a new DC election. However I am not sure this is a good idea; corosync already provides a cluster heartbeat, the pacemakerd monitoring of subdaemons will be a second heartbeat, and layering a third heartbeat over those seems questionable. In addition, if the problem actually lies with the non-DC controller, we could endlessly spawn new elections that interfere with cluster operation. I'd like to think about that some more.

This will likely wait until Bug 1707851 is complete (expected in the 8.4 time frame) to see how effective that approach is.

Fortunately, I think dmsetup suspend is an artificial way of simulating disk failure, since the kernel handles the suspension gracefully (there is no I/O error). A real failure would more likely present like Bug 1596125 and be handled correctly. Of course Pacemaker should handle both situations, but I suspect the real-world risk is low.

Comment 8 Klaus Wenninger 2020-07-30 05:29:08 UTC
Yep - that is definitely a case where a fix to Bug 1707851 could shine.

Would be interesting to see if the fix for Bug 1718324 already detected the issue.

A quick test could be done enabling SBD with defaults (just check if there is a working hardware watchdog - sbd query/test-watchdog - before).
You don't need a shared disk (although it might be interesting if using a partition of a disk that is becoming unresponsive for sbd would be of some use here - like configure sbd without pacemaker-support purely running with a single disk. That should trigger self-fencing if the disk becomes unresponsive. For this kind of misused case the disk wouldn't even have to be visible to other nodes.).
Not even stonith-watchdog-timeout would have to be enabled with pacemaker (would make pacemaker use watchdog-fencing).

An alternative for layered/parallel heartbeats could be to add something to the DC negotiation protocol that requires some kind of renegotiation from time to time. Would require a bit more thinking how to embed that with fixes for Bug 1707851 & 1718324 and support for rolling upgrades.

Although the case that in real-world scenarios we might stay without an I/O error forever might be low the timeout till we get an error might be unacceptable for a cluster that is expected to react quickly. iirc SATA commands timeout after 30s or so. With hardware that is expected to do firmware-updates while operational that might even be more from what I've heard.

Comment 38 Ken Gaillot 2021-12-07 23:05:44 UTC
For added context, Bug 1509319 (RHEL 8.5) created the ocf:heartbeat:storage-mon health agent for monitoring low-level storage devices. That will be sufficient in many cases. However if the resource agent itself or its shell interpreter is on the disk of interest, that solution will not consistently work, so a Pacemaker-internal solution could still be useful.

Comment 39 Ken Gaillot 2022-01-05 16:41:00 UTC
Moving to 9.0 since the necessary libqb support will not be in 8.6. The fix should land in 8.7 via rebase.

Comment 41 Ken Gaillot 2022-01-12 20:55:55 UTC
Moving back to RHEL 8 (8.7) since this will not make 9.0 either; it is now expected in 9.1 via rebase