Bug 1872378

Summary: [RFE] Provide a way to add a scsi fencing device to a cluster without requiring a restart of all cluster resources
Product: Red Hat Enterprise Linux 8 Reporter: Chris Feist <cfeist>
Component: pcsAssignee: Miroslav Lisik <mlisik>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Steven J. Levine <slevine>
Priority: urgent    
Version: 8.3CC: cfeist, cluster-maint, cluster-qe, idevat, kgaillot, mlisik, mmazoure, mpospisi, nhostako, omular, sbradley, slevine, tojeline
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: 8.5Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.10.8-4.el8 Doc Type: Enhancement
Doc Text:
.New `pcs` command to update SCSI fencing device without causing restart of all other resources Updating a SCSI fencing device with the `pcs stonith update` command causes a restart of all resources running on the same node where the stonith resource was running. The new `pcs stonith update-scsi-devices` command allows you to update SCSI devices without causing a restart of other cluster resources.
Story Points: ---
Clone Of: 1872376
: 2023845 2035332 (view as bug list) Environment:
Last Closed: 2021-11-09 17:33:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1872376    
Bug Blocks: 1894575, 2023845, 2024522, 2035332    
Attachments:
Description Flags
proposed fix + tests
none
additional fixes + tests none

Description Chris Feist 2020-08-25 15:40:49 UTC
+++ This bug was initially created as a clone of Bug #1872376 +++

Provide a way to add a scsi fencing device to a cluster without requiring a restart of all cluster resources

---

We are still in the early stages of planning this, but pcs would need to add keys to a scsi device (or call the resource agent to add the new keys) and then call pacemaker with a special command to update the stonith device without unfencing.

We would need to add a special pcs command to add a scsi device to the scsi fencing stonith agent.

Comment 1 Ken Gaillot 2020-08-25 21:15:23 UTC
See Bug 1872376 for how I'm thinking this might work.

pcs would unfence each node by directly executing the agent, and make a note of the current timestamp. Then it would call a new crm_resource option to generate a hash of the new resource parameters. Then it could make the parameter changes in the CIB, simultaneously updating the op-digest (and potentially op-secure-digest) attributes of the lrm_rsc_op for the fence device's start operation on the node where it's currently active, as well as the #node-unfenced, #digests-all, and #digests-secure node attributes for any node that has them, using the saved timestamp and generated hash.

It's ugly but I can't think of anything better.

Comment 6 Ken Gaillot 2021-01-12 18:37:53 UTC
FYI, the required Pacemaker support has been added upstream and should land in 8.4. Bug 1872376 has an example and the supported feature set.

Comment 11 Miroslav Lisik 2021-07-08 14:18:48 UTC
Created attachment 1799690 [details]
proposed fix + tests

Added command:
pcs stonith update-scsi-devices

Test:

# pcs stonith update-scsi-devices scsi-fencing set <device-path1> <device-path2>

Scsi devices should be unfenced and updated and no resources should be restarted.

Comment 12 Miroslav Lisik 2021-07-09 07:21:28 UTC
Test:

[root@r8-node-01 ~]# rpm -q pcs
pcs-0.10.8-3.el8.x86_64

export disk1=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b
export disk2=/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe
export disk3=/dev/disk/by-id/scsi-360014052f8c6f3de01047c29b72040f4
export disk4=/dev/disk/by-id/scsi-360014058c228bdd68b1499c89b426063
[root@r8-node-01 ~]# for d in $disk{1..4}; do sg_persist -n -i -k -d $d ; done
  PR generation=0x23, there are NO registered reservation keys
  PR generation=0xf, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys
[root@r8-node-01 ~]# pcs stonith create scsi-fencing fence_scsi devices="$disk1" pcmk_host_check="static-list" pcmk_host_list="r8-node-01 r8-node-02" pcmk_reboot_action="off" meta provides="unfencing"
[root@r8-node-01 ~]# for d in $disk{1..4}; do sg_persist -n -i -k -d $d ; done
  PR generation=0x25, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0xf, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys
  PR generation=0x0, there are NO registered reservation keys
[root@r8-node-01 ~]# for i in $(seq -w 01 04); do pcs resource create d-$i ocf:pacemaker:Dummy; done
[root@r8-node-01 ~]# pcs stonith
  * scsi-fencing        (stonith:fence_scsi):    Started r8-node-01
[root@r8-node-01 ~]# pcs resource
  * d-01        (ocf::pacemaker:Dummy):  Started r8-node-02
  * d-02        (ocf::pacemaker:Dummy):  Started r8-node-01
  * d-03        (ocf::pacemaker:Dummy):  Started r8-node-02
  * d-04        (ocf::pacemaker:Dummy):  Started r8-node-01
[root@r8-node-01 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)


### Updating scsi devices (adding 3 new devices):

[root@r8-node-01 ~]# pcs stonith update-scsi-devices scsi-fencing set $disk1 $disk2 $disk3 $disk4

Result: scsi devices have been updated, no resources have been restarted and devices are unfenced, there are keys frome each node on the devices.

[root@r8-node-01 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b,/dev/disk/by-id/scsi-360014052f8c6f3de01047c29b72040f4,/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe,/dev/disk/by-id/scsi-360014058c228bdd68b1499c89b426063 pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)


[root@r8-node-01 ~]# tail -n 0 -f /var/log/messages
Jul  8 15:22:19 r8-node-01 pacemaker-controld[624029]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jul  8 15:22:20 r8-node-01 pacemaker-schedulerd[624028]: notice: Calculated transition 8, saving inputs in /var/lib/pacemaker/pengine/pe-input-163.bz2
Jul  8 15:22:20 r8-node-01 pacemaker-controld[624029]: notice: Transition 8 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-163.bz2): Complete
Jul  8 15:22:20 r8-node-01 pacemaker-controld[624029]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jul  8 15:22:39 r8-node-01 kernel: sdd: sdd1
Jul  8 15:22:39 r8-node-01 kernel: sda: sda1
Jul  8 15:22:39 r8-node-01 kernel: sdd: sdd1
Jul  8 15:22:40 r8-node-01 pacemaker-controld[624029]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jul  8 15:22:40 r8-node-01 pacemaker-fenced[624025]: notice: Added 'scsi-fencing' to device list (1 active device)
Jul  8 15:22:40 r8-node-01 pacemaker-schedulerd[624028]: notice: Calculated transition 9, saving inputs in /var/lib/pacemaker/pengine/pe-input-164.bz2
Jul  8 15:22:40 r8-node-01 pacemaker-controld[624029]: notice: Transition 9 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-164.bz2): Complete
Jul  8 15:22:40 r8-node-01 pacemaker-controld[624029]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE

[root@r8-node-02 ~]# tail -n 0 -f /var/log/messages
Jul  8 15:22:39 r8-node-02 kernel: sdd: sdd1
Jul  8 15:22:39 r8-node-02 kernel: sd 6:0:0:0: Parameters changed
Jul  8 15:22:39 r8-node-02 kernel: sda: sda1
Jul  8 15:22:39 r8-node-02 kernel: sdd: sdd1
Jul  8 15:22:40 r8-node-02 pacemaker-fenced[577901]: notice: Added 'scsi-fencing' to device list (1 active device)

[root@r8-node-01 ~]# for d in $disk{1..4}; do sg_persist -n -i -k -d $d ; done
  PR generation=0x26, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x11, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x2, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x2, 2 registered reservation keys follow:
    0x14080000
    0x14080001

### Updating scsi devices (removing 2 devices):


[root@r8-node-01 ~]# pcs stonith update-scsi-devices scsi-fencing set $disk3 $disk4

Result: scsi devices have been updated, no resources have been restarted and devices are unfenced, there are keys frome each node on the devices.
NOTE: command does not undo unfencing which is previous cluster behavior

[root@r8-node-01 ~]# pcs stonith config
 Resource: scsi-fencing (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052f8c6f3de01047c29b72040f4,/dev/disk/by-id/scsi-360014058c228bdd68b1499c89b426063 pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (scsi-fencing-monitor-interval-60s)

[root@r8-node-01 ~]# tail -n 0 -f /var/log/messages
Jul  8 15:34:48 r8-node-01 pacemaker-controld[624029]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jul  8 15:34:48 r8-node-01 pacemaker-fenced[624025]: notice: Added 'scsi-fencing' to device list (1 active device)
Jul  8 15:34:48 r8-node-01 pacemaker-schedulerd[624028]: notice: Calculated transition 10, saving inputs in /var/lib/pacemaker/pengine/pe-input-165.bz2
Jul  8 15:34:48 r8-node-01 pacemaker-controld[624029]: notice: Transition 10 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-165.bz2): Complete
Jul  8 15:34:48 r8-node-01 pacemaker-controld[624029]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE

[root@r8-node-02 ~]# tail -n 0 -f /var/log/messages
Jul  8 15:34:48 r8-node-02 pacemaker-fenced[577901]: notice: Added 'scsi-fencing' to device list (1 active device)


[root@r8-node-01 ~]# for d in $disk{1..4}; do sg_persist -n -i -k -d $d ; done
  PR generation=0x26, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x11, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x3, 2 registered reservation keys follow:
    0x14080000
    0x14080001
  PR generation=0x3, 2 registered reservation keys follow:
    0x14080000
    0x14080001

### Fence node r8-node-02:

[root@r8-node-01 ~]# pcs stonith fence r8-node-02
Node: r8-node-02 fenced
[root@r8-node-01 ~]# for d in $disk{3,4}; do sg_persist -n -i -k -d $d ; done
  PR generation=0x4, 1 registered reservation key follows:
    0x14080000
  PR generation=0x4, 1 registered reservation key follows:
    0x14080000

Comment 17 Miroslav Lisik 2021-07-21 11:20:26 UTC
Created attachment 1804075 [details]
additional fixes + tests

Fixes for 3 special cases.

Modified command:
* pcs stonith update-scsi-devices


Test:

Environment:
A 3-node cluster with configured fence_scsi fencing and resources running on each node.

A)
1. Stop cluster on one of cluster nodes
2. Use command 'pcs stonith update-scsi-devices' to update scsi devices
3. Command is successful

B)
1. Shutdown a cluster node or stop pcsd
2. Use command 'pcs stonith update-scsi-devices' to update scsi devices without and with option --skip-offline
3. Command is successful using --skip-offline node

C)
1. on one cluster node fail shared device that is going to be added
2. Use command 'pcs stonith update-scsi-devices' to update scsi devices
3. Command failed and devices are not updated

Comment 18 Miroslav Lisik 2021-07-21 11:29:38 UTC
DevTestResults:

[root@r8-node-01 pcs]# rpm -q pcs
pcs-0.10.8-4.el8.x86_64

A)

[root@r8-node-01 pcs]# pcs status pcsd
  r8-node-03: Online
  r8-node-02: Online
  r8-node-01: Online
[root@r8-node-01 pcs]# pcs status nodes corosync
Corosync Nodes:
 Online: r8-node-01 r8-node-02
 Offline: r8-node-03
[root@r8-node-01 ~]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)


[root@r8-node-01 ~]# pcs stonith update-scsi-devices fence-scsi set $disk1 $disk2
[root@r8-node-01 ~]# echo $?
0
[root@r8-node-01 ~]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b,/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

B)

[root@r8-node-01 pcs]# pcs status pcsd
  r8-node-03: Offline
  r8-node-02: Online
  r8-node-01: Online
[root@r8-node-01 pcs]# pcs status nodes corosync
Corosync Nodes:
 Online: r8-node-01 r8-node-02
 Offline: r8-node-03
[root@r8-node-01 ~]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

[root@r8-node-01 ~]# pcs stonith update-scsi-devices fence-scsi set $disk1 $disk2
[root@r8-node-01 ~]# pcs stonith update-scsi-devices fence-scsi set $disk1 $disk2
Error: Unable to connect to r8-node-03 (Failed to connect to r8-node-03 port 2224: Connection refused), use --skip-offline to override
Error: Errors have occurred, therefore pcs is unable to continue
[root@r8-node-01 ~]# echo $?
1
[root@r8-node-01 ~]# pcs stonith update-scsi-devices fence-scsi set $disk1 $disk2 --skip-offline
Warning: Unable to connect to r8-node-03 (Failed to connect to r8-node-03 port 2224: Connection refused)
[root@r8-node-01 ~]# echo $?
0
[root@r8-node-01 ~]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b,/dev/disk/by-id/scsi-3600140547721f8ee2774aa8bac6d8ebe pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

C)

[root@r8-node-01 pcs]# pcs status pcsd
  r8-node-03: Online
  r8-node-02: Online
  r8-node-01: Online
[root@r8-node-01 pcs]# pcs status nodes corosync
Corosync Nodes:
 Online: r8-node-01 r8-node-02 r8-node-03
 Offline:
[root@r8-node-01 ~]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)


### Failing $disk2 on the r8-node-3:
[root@r8-node-03 ~]# echo offline > /sys/block/sda/device/state


[root@r8-node-01 ~]# pcs stonith update-scsi-devices fence-scsi set $disk1 $disk2
Error: r8-node-03: Unfencing failed:
2021-07-20 19:40:02,530 ERROR: Cannot get registration keys

2021-07-20 19:40:02,531 ERROR: Please use '-h' for usage
Error: Errors have occurred, therefore pcs is unable to continue
[root@r8-node-01 ~]# echo $?
1
[root@r8-node-01 ~]# pcs stonith config
 Resource: fence-scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/disk/by-id/scsi-360014052bc36324cf7d4a709a959340b pcmk_host_check=static-list pcmk_host_list="r8-node-01 r8-node-02 r8-node-03" pcmk_reboot_action=off
  Meta Attrs: provides=unfencing
  Operations: monitor interval=60s (fence-scsi-monitor-interval-60s)

Comment 24 errata-xmlrpc 2021-11-09 17:33:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: pcs security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4142