Bug 1991654

Summary: update-scsi-devices command unfence a node without quorum
Product: Red Hat Enterprise Linux 8 Reporter: Michal Mazourek <mmazoure>
Component: pcsAssignee: Miroslav Lisik <mlisik>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.5CC: cfeist, cluster-maint, idevat, kmalyjur, lmiksik, mlisik, mpospisi, nhostako, omular, sbradley, tojeline
Target Milestone: betaKeywords: Triaged
Target Release: 8.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.10.10-4.el8 Doc Type: Bug Fix
Doc Text:
The plan is to get the fix done before the bugged pcs packages are released.
Story Points: ---
Clone Of:
: 2003066 (view as bug list) Environment:
Last Closed: 2021-11-09 17:34:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2003066    
Attachments:
Description Flags
proposed fix + tests none

Description Michal Mazourek 2021-08-09 15:34:32 UTC
Description of problem:
A new command 'pcs stonith update-scsi-devices' unfence a node, that is not quorate. 


Version-Release number of selected component (if applicable):
pcs-0.10.8-4.el8


How reproducible:
always


Steps to Reproduce:

## Having scsi fencing set (3 disks, 3 nodes)

[root@virt-499 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/wwn-0x60014050ca58fa9f66b488491466c401,/dev/disk/by-id/wwn-0x6001405978d3d55b2f34d3481433377c,/dev/disk/by-id/wwn-0x6001405e9ba8116b7a944cfb4b88b767 pcmk_host_check=static-list pcmk_host_list="virt-499 virt-504 virt-519" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)
[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x6001405e9ba8116b7a944cfb4b88b767
  PR generation=0x27, 3 registered reservation keys follow:
    0x2e6e0000
    0x2e6e0002
    0x2e6e0001
[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x6001405978d3d55b2f34d3481433377c
  PR generation=0x1c, 3 registered reservation keys follow:
    0x2e6e0002
    0x2e6e0000
    0x2e6e0001
[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x60014050ca58fa9f66b488491466c401
  PR generation=0x1a, 3 registered reservation keys follow:
    0x2e6e0002
    0x2e6e0001
    0x2e6e0000


## Fence one node by blocking corosync ports

[root@virt-519 ~]# ip6tables -A INPUT ! -i lo -p udp --dport 5404 -j DROP && ip6tables -A INPUT ! -i lo -p udp --dport 5405 -j DROP && ip6tables -A OUTPUT ! -o lo -p udp --sport 5404 -j DROP && ip6tables -A OUTPUT ! -o lo -p udp --sport 5405 -j DROP

[root@virt-519 ~]# corosync-quorumtool | grep Quorate
Quorate:          No

# on other nodes
[root@virt-499 ~]# corosync-quorumtool | grep Quorate
Quorate:          Yes
Flags:            Quorate 
[root@virt-504 ~]# corosync-quorumtool | grep Quorate
Quorate:          Yes
Flags:            Quorate 


## Checking registered keys on the disks

[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x6001405e9ba8116b7a944cfb4b88b767
  PR generation=0x28, 2 registered reservation keys follow:
    0x2e6e0000
    0x2e6e0001
[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x6001405978d3d55b2f34d3481433377c
  PR generation=0x1d, 2 registered reservation keys follow:
    0x2e6e0000
    0x2e6e0001
[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x60014050ca58fa9f66b488491466c401
  PR generation=0x1b, 2 registered reservation keys follow:
    0x2e6e0001
    0x2e6e0000

> So far OK, the fence is recognized, the node's registration key was deleted from the disks


## Updating fence_scsi disks, while one node is still fenced

[root@virt-499 ~]# pcs stonith update-scsi-devices scsi-fencing set /dev/disk/by-id/wwn-0x6001405e9ba8116b7a944cfb4b88b767 /dev/disk/by-id/wwn-0x6001405978d3d55b2f34d3481433377c /dev/disk/by-id/wwn-0x60014050ca58fa9f66b488491466c401
[root@virt-499 ~]# echo $?
0

[root@virt-519 ~]# corosync-quorumtool | grep Quorate
Quorate:          No

[root@virt-499 ~]# pcs status | grep Node -A 2
Node List:
  * Online: [ virt-499 virt-504 ]
  * OFFLINE: [ virt-519 ]


## Checking registered keys on the disks again

[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x6001405e9ba8116b7a944cfb4b88b767
  PR generation=0x2a, 3 registered reservation keys follow:
    0x2e6e0000
    0x2e6e0001
    0x2e6e0002
[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x6001405978d3d55b2f34d3481433377c
  PR generation=0x1f, 3 registered reservation keys follow:
    0x2e6e0000
    0x2e6e0001
    0x2e6e0002
[root@virt-499 ~]# sg_persist -n -i -k -d /dev/disk/by-id/wwn-0x60014050ca58fa9f66b488491466c401
  PR generation=0x1d, 3 registered reservation keys follow:
    0x2e6e0001
    0x2e6e0000
    0x2e6e0002


Actual results:
The update unfenced the node without quorum


Expected results:
The update preferably shouldn't unfence unquorate node

Comment 6 Miroslav Lisik 2021-09-24 08:52:30 UTC
Created attachment 1825859 [details]
proposed fix + tests

Updated command:
* pcs stonith update-scsi-devices

Test:
* setup a cluster with a fence scsi stonith resource
* setup resources running on each node
* block corosync traffic on one cleuster node and wait until node is fenced
* add scsi devices by using command `pcs stonith update-scsi-devices add` or pcs stonith update-scsi-devices set`
* see result, which should be that devices are unfenced only on nodes which are note fenced and resources are not restarted.

Comment 7 Miroslav Lisik 2021-09-24 11:12:30 UTC
DevTestResults:

[root@r8-node-01 ~]# rpm -q pcs
pcs-0.10.10-4.el8.x86_64

Environment: Cluster with a fence_scsi stonith resource and reosurces running on each node.


[root@r8-node-01 pcs]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)
[root@r8-node-01 pcs]# pcs resource
  * d-01        (ocf::pacemaker:Dummy):  Started r8-node-02
  * d-02        (ocf::pacemaker:Dummy):  Started r8-node-03
  * d-03        (ocf::pacemaker:Dummy):  Started r8-node-01
  * d-04        (ocf::pacemaker:Dummy):  Started r8-node-02
  * d-05        (ocf::pacemaker:Dummy):  Started r8-node-03
  * d-06        (ocf::pacemaker:Dummy):  Started r8-node-01
[root@r8-node-01 pcs]# echo $disk{1..3}
/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b /dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe /dev/disk/by-id/scsi-360014052f8c6f3de01047c29b72040f4
[root@r8-node-01 pcs]# for disk in $disk{1..3}; do sg_persist -n -i -k -d $disk; done
  PR generation=0x8, 3 registered reservation keys follow:
    0x14080000
    0x14080001
    0x14080002
  PR generation=0x5, there are NO registered reservation keys
  PR generation=0x4, there are NO registered reservation keys

### Block corosync traffic:

[root@r8-node-03 ~]# iptables -A INPUT ! -i lo -p udp --dport 5404 -j DROP && iptables -A INPUT ! -i lo -p udp --dport 5405 -j DROP && iptables -A OUTPUT ! -o lo -p udp --sport 5404 -j DROP && iptables -A OUTPUT ! -o lo -p udp --sport 5405 -j DROP
[root@r8-node-03 ~]# pcs status nodes
Pacemaker Nodes:
 Online: r8-node-03
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline: r8-node-01 r8-node-02
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:
[root@r8-node-01 pcs]# for disk in $disk{1..3}; do sg_persist -n -i -k -d $disk; done
  PR generation=0x8, 3 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x5, there are NO registered reservation keys
  PR generation=0x4, there are NO registered reservation keys

[root@r8-node-01 pcs]# pcs status nodes
Pacemaker Nodes:
 Online: r8-node-01 r8-node-02
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline: r8-node-03
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:

### Add scsi devices

[root@r8-node-01 pcs]# pcs stonith update-scsi-devices fence-scsi add $disk2 $disk3
r8-node-03: Unfencing skipped, device '/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b' is fenced
[root@r8-node-01 pcs]# echo $?
0

### Check registration keys on the disks

[root@r8-node-01 pcs]# for disk in $disk{1..3}; do sg_persist -n -i -k -d $disk; done
  PR generation=0x9, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x7, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x6, 2 registered reservation keys follow:
    0x14080000
    0x14080001

There is no key of the fenced node.

Comment 11 Michal Mazourek 2021-09-30 13:52:01 UTC
AFTER:
======

[root@virt-488 ~]# rpm -q pcs
pcs-0.10.10-4.el8.x86_64


## Environment: cluster with 3 nodes, 3 shared disks, dummy resource and scsi fencing

# nodes
[root@virt-488 ~]# pcs status nodes
Pacemaker Nodes:
 Online: virt-488 virt-489 virt-527
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:

# disks
[root@virt-488 ~]# ls -lr /dev/disk/by-id/ | grep -m 3 "sda\|sdb\|sdc" 
lrwxrwxrwx. 1 root root  9 Sep 30 10:54 wwn-0x60014057d85bd87407f4f498e819029a -> ../../sdc
lrwxrwxrwx. 1 root root  9 Sep 30 10:54 wwn-0x60014057d43430762ed4fbfbc895e26e -> ../../sdb
lrwxrwxrwx. 1 root root  9 Sep 30 10:54 wwn-0x600140566f7eadb8310437c8a08d9309 -> ../../sda
[root@virt-488 ~]# export DISK1=/dev/disk/by-id/wwn-0x600140566f7eadb8310437c8a08d9309
[root@virt-488 ~]# export DISK2=/dev/disk/by-id/wwn-0x60014057d43430762ed4fbfbc895e26e
[root@virt-488 ~]# export DISK3=/dev/disk/by-id/wwn-0x60014057d85bd87407f4f498e819029a

[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x0, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys

# scsi fencing
[root@virt-488 ~]# pcs stonith create scsi-fencing fence_scsi devices="$DISK1" pcmk_host_check="static-list" pcmk_host_list="virt-488 virt-489 virt-527" pcmk_reboot_action="off" meta provides="unfencing"
[root@virt-488 ~]# echo $?
0
[root@virt-488 ~]# pcs stonith
  * scsi-fencing	(stonith:fence_scsi):	 Started virt-488
[root@virt-488 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/wwn-0x600140566f7eadb8310437c8a08d9309 pcmk_host_check=static-list pcmk_host_list="virt-488 virt-489 virt-527" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)

# keys on the disks
[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x3, 3 registered reservation keys follow:
    0xc5370000
    0xc5370002
    0xc5370001
  PR generation=0x0, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys

# resource with its start time
[root@virt-488 ~]# pcs resource create dummy1 ocf:heartbeat:Dummy
[root@virt-488 ~]# crm_resource --list-all-operations --resource dummy1 | grep start
dummy1	(ocf::heartbeat:Dummy):	 Started: dummy1_start_0 (node=virt-489, call=65, rc=0, last-rc-change=Thu Sep 30 14:22:42 2021, exec=20ms): complete


## Fencing one node by blocking a corosync traffic

[root@virt-527 ~]# ip6tables -A INPUT ! -i lo -p udp --dport 5404 -j DROP && ip6tables -A INPUT ! -i lo -p udp --dport 5405 -j DROP && ip6tables -A OUTPUT ! -o lo -p udp --sport 5404 -j DROP && ip6tables -A OUTPUT ! -o lo -p udp --sport 5405 -j DROP
[root@virt-527 ~]# echo $?
0
[root@virt-527 ~]# corosync-quorumtool | grep Quorate
Quorate:          No

# checking nodes
[root@virt-488 ~]# pcs status nodes
Pacemaker Nodes:
 Online: virt-488 virt-489
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline: virt-527
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:

# checking the keys on the devices
[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x4, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x0, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys


## Adding scsi device 

# update-scsi-devices add
[root@virt-488 ~]# pcs stonith update-scsi-devices scsi-fencing add $DISK2
virt-527: Unfencing skipped, device '/dev/disk/by-id/wwn-0x600140566f7eadb8310437c8a08d9309' is fenced
[root@virt-488 ~]# echo $?
0

> OK: Warning message that is notifying one device was skipped with unfencing

[root@virt-488 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/wwn-0x600140566f7eadb8310437c8a08d9309,/dev/disk/by-id/wwn-0x60014057d43430762ed4fbfbc895e26e pcmk_host_check=static-list pcmk_host_list="virt-488 virt-489 virt-527" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)

[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x4, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x0, there are NO registered reservation keys

> OK: A node without quorum wasn't unfenced

# update-scsi-devices set
[root@virt-488 ~]# pcs stonith update-scsi-devices scsi-fencing set $DISK3
virt-527: Unfencing skipped, devices '/dev/disk/by-id/wwn-0x600140566f7eadb8310437c8a08d9309', '/dev/disk/by-id/wwn-0x60014057d43430762ed4fbfbc895e26e' are fenced
[root@virt-488 ~]# echo $?
0

[root@virt-488 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/wwn-0x60014057d85bd87407f4f498e819029a pcmk_host_check=static-list pcmk_host_list="virt-488 virt-489 virt-527" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)

[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x4, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370001
    0xc5370000

> OK: Fenced node's key isn't registered

# combination of add and remove
[root@virt-488 ~]# pcs stonith update-scsi-devices scsi-fencing remove $DISK3 add $DISK1
virt-527: Unfencing skipped, device '/dev/disk/by-id/wwn-0x60014057d85bd87407f4f498e819029a' is fenced
[root@virt-488 ~]# echo $?
0
[root@virt-488 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/wwn-0x600140566f7eadb8310437c8a08d9309 pcmk_host_check=static-list pcmk_host_list="virt-488 virt-489 virt-527" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)
[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x5, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370001
    0xc5370000

> OK


## Rebooting the fenced node

[root@virt-527 ~]# pcs status nodes
Pacemaker Nodes:
 Online: virt-488 virt-489 virt-527
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:

[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x6, 3 registered reservation keys follow:
    0xc5370000
    0xc5370001
    0xc5370002
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370000
    0xc5370001
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370001
    0xc5370000

> OK: Key of the node was added to the configured disk

[root@virt-488 ~]# pcs stonith update-scsi-devices scsi-fencing add $DISK2
[root@virt-488 ~]# echo $?
0
[root@virt-488 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/wwn-0x600140566f7eadb8310437c8a08d9309,/dev/disk/by-id/wwn-0x60014057d43430762ed4fbfbc895e26e pcmk_host_check=static-list pcmk_host_list="virt-488 virt-489 virt-527" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)
[root@virt-488 ~]# for DISK in $DISK{1..3}; do sg_persist -n -i -k -d $DISK; done
  PR generation=0x6, 3 registered reservation keys follow:
    0xc5370000
    0xc5370001
    0xc5370002
  PR generation=0x4, 3 registered reservation keys follow:
    0xc5370000
    0xc5370001
    0xc5370002
  PR generation=0x2, 2 registered reservation keys follow:
    0xc5370001
    0xc5370000

> OK


## Checking that the resource has not restarted

[root@virt-488 ~]# crm_resource --list-all-operations --resource dummy1 | grep start
dummy1	(ocf::heartbeat:Dummy):	 Started: dummy1_start_0 (node=virt-489, call=65, rc=0, last-rc-change=Thu Sep 30 14:22:42 2021, exec=20ms): complete

> OK: Start time stayed the same


Marking as VERIFIED for pcs-0.10.10-4.el8

Comment 13 errata-xmlrpc 2021-11-09 17:34:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: pcs security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4142