Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1590273

Summary: Pacemaker fence_scsi device parameter updation (addition/deletion) - restarts all resource
Product: Red Hat Enterprise Linux 7 Reporter: Sandeep Yadav <sanyadav>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.5CC: abeekhof, cluster-maint, jruemker, rdave, sanyadav, sbradley
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-27 17:23:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport none

Description Sandeep Yadav 2018-06-12 10:48:00 UTC
Description of problem: An updation in fence_scsi device parameter causes a cluster wide restart of resources, customer is expecting a more dynamic fence_scsi device updation which should not cause other resource group to restart. Customer is saying very often he has to add/delete resource group which consist of LVM/FS/IP/App resources and accordingly he has to update the fence_scsi device parameter which results a cluster wide restart of resources. Customer doesn't want a new addition/deletion of resource group causes disruption of other resource group service.

- var/log/cluster/corosync.log showing change in fence_scsi parametes causing resources to restart.

May 22 05:35:09 [3511] node2    pengine:     info: rsc_action_digest_cmp:      Parameters to ig-scsi-fnc_start_0 on node1 changed: was 5c6b7753cebc986370b0b74dd3e45abb vs. now 6f915787eee272807dae881764428afd (reload:3.0.14) 0:0;4:3:0:3de2b23c-09d6-4f7c-8709-5ceae047e66f



May 22 05:35:09 [3511] node2    pengine:   notice: LogNodeActions:      * Fence (on) node1 'Device parameters changed (reload)'
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    ig-scsi-fnc     (                   node1 )   due to resource definition change
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    db1_igt_lvm     (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    bck_igt_lvm     (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    db2_igt_lvm     (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    db1_igt_fs      (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    online_igt_fs   (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    exp_igt_fs      (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    db2_igt_fs      (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    ora_igt_vip     (                   node1 )   due to required stonith
May 22 05:35:09 [3511] node2    pengine:   notice: LogAction:   * Restart    ora_igt_ap      (                   node1 )   due to required stonith


Customer is seeking a possibility of creating an independent fence_scsi device for every new resource group in order to avoid clusterwide restart of resources.


Version-Release number of selected component (if applicable):
RHEL 7.5
pacemaker-1.1.18-11.el7.x86_64
fence-agents-scsi-4.0.11-86.el7.x86_64  

How reproducible: Pacemaker fence_scsi device parameter updation (addition/deletion) - restarts all resource


Steps to Reproduce:
Update the value (Add/delete) for the device attribute.
# pcs stonith update scsi-pcmk devices="/dev/mapper/mpathb,/dev/mapper/mpathc"

Actual results: All resources restart


Expected results: cluster wide resource restart should not happen after updating fence_scsi configuration


Additional info:

Below is the closest bug that got fixed in 7.5
https://bugzilla.redhat.com/show_bug.cgi?id=1427648

Comment 2 Ken Gaillot 2018-06-12 16:18:34 UTC
This does sound like Bug 1427648. Can you attach sosreports?

Comment 3 Sandeep Yadav 2018-06-12 16:36:55 UTC
Created attachment 1450570 [details]
sosreport

Here is the sosreport attached to the bugzila.

Comment 4 Ken Gaillot 2018-06-12 21:40:40 UTC
This is expected and unfortunately necessary behavior.

When a fence_scsi device is modified, the unfence action must be repeated, since new configuration values will be passed to it. Because all other resources are implicitly ordered after unfencing, they must first be stopped, then unfencing can proceed, then the resources can be started.

Comment 5 Sandeep Yadav 2018-06-14 10:15:01 UTC
Hello Ken,

Thank you for your reply I have shared that with customer, however the customer who has requested this RFE is looking for more detailed information, he wants to know how the fence_scsi functionality work in the background, what technical limitation Red Hat is facing in implimenting this changes. Can you please eloborate more on this.


Thank you in advance.

Regards,
Sandeep

Comment 6 Ken Gaillot 2018-06-14 21:09:51 UTC
fence_scsi works by having each node register a unique key with each SCSI device. It then uses a SCSI-3 persistent reservation to ensure that only registrants can write to the device. Fencing a node is simply removing its key, which makes it no longer able to write.

To make that model work, fence_scsi has to support "unfencing", which in its case is the key registration. Whenever a node first joins the cluster, it must be unfenced (i.e. be able to write to the devices) before it can run resources.

When a fence_scsi device's configuration changes, unfencing must be reapplied to all nodes. Keep in mind that pacemaker doesn't know the details of how a fence device works -- the fence agent is the abstract interface to the actual device. Pacemaker only knows that the previous unfencing may now be insufficient, and must be redone with the new parameters. Due to the usual ordering of actions, all resources must be stopped before unfencing can be done, and then they can be restarted after unfencing succeeds.

In fence_scsi's case, when you add a new device, it's easy to see that every node must register a key with that device. So, unfencing must be done. There may be resources that do not depend on that particular device and thus do not really need to be restarted, but that knowledge is not available to the cluster.

Comment 7 John Ruemker 2018-11-27 17:23:13 UTC
As Ken noted - this is behaving as expected and is required functionality.  

If the customer cannot work within the limitations this feature introduces, or if they need additional information about how the product operates, then please work with the experts in the HA support team to address their needs. 

I am closing this.

Comment 8 Ken Gaillot 2020-09-29 15:55:22 UTC
For the record: We hope to have a workaround for this for RHEL 8. Pacemaker Bug 1872376 would allow pcs to compute the hashes Pacemaker would use to determine whether resource configuration changed, and pcs Bug 1872378 would provide an interface specifically for adding a disk to fence_scsi devices that would modify the hashes at the same time as the configuration.