Bug 2032997

Summary:	When a fence-scsi storage device becomes unavailable, pcs cannot remove it from configuration and add a running one
Product:	Red Hat Enterprise Linux 8	Reporter:	Nina Hostakova <nhostako>
Component:	pcs	Assignee:	Miroslav Lisik <mlisik>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.5	CC:	cluster-maint, idevat, kmalyjur, mlisik, mpospisi, omular, tojeline
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.6
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pcs-0.10.12-3.el8	Doc Type:	Bug Fix
Doc Text:	If this bug requires documentation, please select an appropriate Doc Type value.	Story Points:	---
Clone Of:
Clones:	2033248 (view as bug list)		Environment:
Last Closed:	2022-05-10 14:50:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2033248

Description Nina Hostakova 2021-12-15 16:37:19 UTC

Description of problem:
Having a fence_scsi configured with a disk that suddenly gets unavailable, pcs cannot remove it and add a new disk or just add a new running disk itself (using 'pcs stonith update-scsi-devices' add/remove syntax).

Version-Release number of selected component (if applicable):
pcs-0.10.12-1.el8.

How reproducible:
always

Steps to Reproduce:

[root@virt-033 ~]# export DISK1=/dev/disk/by-id/scsi-360014055b5a24c2c777487cb03a19c98
[root@virt-033 ~]# export DISK2=/dev/disk/by-id/scsi-36001405c04d8aed56e04cd1a3b0aabe8
[root@virt-033 ~]# export DISK3=/dev/disk/by-id/scsi-36001405d95ea40f8b2a4a3d86d866e20

1. Configure fence_scsi with sda shared disk
[root@virt-033 ~]# pcs stonith create scsi-fencing fence_scsi devices="$DISK1" pcmk_host_check="static-list" pcmk_host_list="virt-033 virt-034 virt-035" pcmk_reboot_action="off" meta provides="unfencing"

2. Make the disk unavailable
[root@virt-033 ~]# echo "offline" > /sys/block/sda/device/state

3. 
[root@virt-033 ~]# pcs stonith update-scsi-devices scsi-fencing remove $DISK1 add $DISK3
Error: virt-033: Unfencing failed, unable to check status of device '/dev/disk/by-id/scsi-360014055b5a24c2c777487cb03a19c98': 2021-12-15 12:10:35,797 ERROR: Cannot get registration keys

2021-12-15 12:10:35,798 ERROR: Please use '-h' for usage
Error: Errors have occurred, therefore pcs is unable to continue
[root@virt-033 ~]# echo $?
1

[root@virt-033 ~]# pcs stonith update-scsi-devices scsi-fencing add $DISK3
Error: virt-033: Unfencing failed, unable to check status of device '/dev/disk/by-id/scsi-360014055b5a24c2c777487cb03a19c98': 2021-12-15 12:11:06,424 ERROR: Cannot get registration keys

2021-12-15 12:11:06,425 ERROR: Please use '-h' for usage
Error: Errors have occurred, therefore pcs is unable to continue
[root@virt-033 ~]# echo $?
1


Expected results:
It should be possible to remove an unavailable disk or add a new available disk 

Additional info:
This bug is also present in 8.5 and was introduced by the fix to prevent unfencing of nodes without quorum (bz1991654)

Comment 2 Miroslav Lisik 2021-12-16 09:43:34 UTC

There are 2 cases:

1) adding a new disk and removing the unavailable disk
This can be fixed.
There is a workaround: Update disk in 2 steps. First remove unavailable disks and then add a new disk.

2) add disk to a cluster with unavailable disks
This is ok to fail because it is not possible to check if the node was fenced.

Comment 3 Miroslav Lisik 2022-01-13 13:23:36 UTC

Proposed fix:
https://github.com/ClusterLabs/pcs/commit/b3332366d280379edfe0dc95e8f789b02aac1166

Test:

Case 1:

[root@r8-node-01 pcs]# export $disk1=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b
[root@r8-node-01 pcs]# export $disk2=/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe
[root@r8-node-01 pcs]# export $disk3=/dev/disk/by-id/scsi-360014052f8c6f3de01047c29b72040f4

[root@r8-node-01 pcs]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

[root@r8-node-01 pcs]# echo offline > /sys/block/$(basename $(readlink $disk1))/device/state;
[root@r8-node-01 pcs]# cat /sys/block/$(basename $(readlink $disk1))/device/state;
offline

[root@r8-node-01 pcs]# lpcs stonith update-scsi-devices fence-scsi add $disk2 remove $disk1
[root@r8-node-01 pcs]# echo $?
0
[root@r8-node-01 pcs]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)


Case 2:

[root@r8-node-01 pcs]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

[root@r8-node-01 pcs]# echo offline > /sys/block/$(basename $(readlink $disk2))/device/state;
[root@r8-node-01 pcs]# cat /sys/block/$(basename $(readlink $disk2))/device/state;
offline

[root@r8-node-01 pcs]# lpcs stonith update-scsi-devices fence-scsi add $disk3
Error: r8-node-01: Unfencing failed, unable to check status of device '/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe': 2022-01-13 14:10:52,938 ERROR: Cannot get registration keys

2022-01-13 14:10:52,939 ERROR: Please use '-h' for usage
Error: Errors have occurred, therefore pcs is unable to continue
(pcs) [root@r8-node-01 pcs]# echo $?
1
[root@r8-node-01 pcs]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

Comment 4 Miroslav Lisik 2022-01-14 14:43:39 UTC

DevTestResults:

[root@r8-node-01 pcs]# rpm -q pcs
pcs-0.10.12-3.el8.x86_64

[root@r8-node-01 pcs]# export $disk1=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b
[root@r8-node-01 pcs]# export $disk2=/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe

[root@r8-node-01 pcs]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

[root@r8-node-01 pcs]# echo offline > /sys/block/$(basename $(readlink $disk1))/device/state
[root@r8-node-01 pcs]# cat /sys/block/$(basename $(readlink $disk1))/device/state
offline

[root@r8-node-01 pcs]# pcs stonith update-scsi-devices fence-scsi add $disk2 remove $disk1
[root@r8-node-01 pcs]# echo $?
0
[root@r8-node-01 pcs]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

Comment 10 errata-xmlrpc 2022-05-10 14:50:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:1978