Bug 1872376
Summary: | Add command line option to calculate Pacemaker resource operation digest | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Chris Feist <cfeist> | |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 8.3 | CC: | cluster-maint, msmazova, sbradley | |
Target Milestone: | rc | Keywords: | FutureFeature, Triaged | |
Target Release: | 8.4 | Flags: | pm-rhel:
mirror+
|
|
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-2.0.5-5.el8 | Doc Type: | No Doc Update | |
Doc Text: |
This new feature exists primarily for pcs's use; end users will not find it useful.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1872378 (view as bug list) | Environment: | ||
Last Closed: | 2021-05-18 15:26:40 UTC | Type: | Feature Request | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1872378, 1894575, 2023845, 2024522 |
Description
Chris Feist
2020-08-25 15:38:34 UTC
Pacemaker restarts resources when unfencing device configuration changes because the resources have a dependency (both in code and in reality) on unfencing with the current parameters. Pacemaker detects configuration changes by saving a hash of the resource parameters in the operation history recorded for any action. When checking current conditions, it compares that recorded hash with a re-calculated hash of the current parameters. Unfortunately neither Pacemaker nor the fence agent has enough information to know when a parameter change is "safe" to perform without restarting resources. Such safety is dependent on the particular capabilities of a given fence device. Neither Pacemaker nor the agent knows what the previous parameter values were (though Pacemaker knows the hash). Therefore, there cannot be a general, automated solution to the problem. However, I can think of a higher-level workaround. Pacemaker could provide a new command-line option crm_resource -r <rsc> --digest <op> that would show what the operation digests would be for the given resource and operation, if run at that moment. pcs could use the existing "stonith_admin --unfence" command to execute the unfencing directly, without going through the usual cluster management. It could then make two CIB changes simultaneously, the desired change in the fence device configuration plus updates to the operation hashes and node attributes that pacemaker uses to detect changes, so that pacemaker does not see or react to the change (other than rescheduling any recurring monitor on the device, which is still desirable). This would be a dangerous ability since it would disable pacemaker's response to changes, leaving it totally on the caller to ensure that it is safe to do so. It would also be somewhat brittle since it would involve changes to pacemaker's status section, which does not have a guaranteed schema. In short, it's a bad idea, but I don't see a better way. > pcs could use the existing "stonith_admin --unfence" command to execute the unfencing directly
Correction: that would use the originally configured parameters. Instead, pcs would most likely have to run the agent directly, with the new parameters. We might be able to come up with some "direct execute" option in stonith_admin to abstract that, but I'm not keen on that idea. Also, fence agents are executed by the cluster as root, but stonith_admin runs as whatever user runs it, which could complicate some situations.
> crm_resource -r <rsc> --digest <op>
This command would take either -r <rsc> (to use the existing parameters) or --class/--agent/--option (same as --validate currently) to hash arbitrary parameters.
Implemented upstream as of commit 4e726eb Example of how to change a resource parameter without causing a restart: 1. The following configuration items are needed (examples): - resource ID (rsc1) - parameter name (param1) - desired parameter value (value1) - resource's monitor interval (10s) - resource's monitor timeout if specified (20s) 2. Determine where rsc1 is running with "crm_resource --locate -r rsc1" (example: node1). 3. Show what the new digests would be for a one-time operation: crm_resource --digests --output-as=xml -r rsc1 -N node1 param1=value1 Example output: <pacemaker-result api-version="2.3" request="crm_resource --digests --output-as=xml -r rsc1 -N node1 param1=value1"> <digests resource="rsc1" node="node1" task="monitor" interval="0ms"> <digest type="all" hash="f2317cad3d54cec5d7d7aa7d0bf35cf8"> <parameters/> </digest> <digest type="nonprivate" hash="f2317cad3d54cec5d7d7aa7d0bf35cf8"> <parameters/> </digest> <digest type="nonreloadable" hash="f2317cad3d54cec5d7d7aa7d0bf35cf8"> <parameters/> </digest> </digests> <status code="0" message="OK"/> </pacemaker-result> 4. Repeat for the recurring monitor: crm_resource --digests --output-as=xml -r rsc1 -N node1 param1=value1 CRM_meta_interval=10000 CRM_meta_timeout=20000 Output will be similar. 5. Update the CIB: 5a. Dump the entire CIB to a file. 5b. Edit the resource configuration to have the new parameter value. 5c. Find the resource history on the appropriate node (the section starting <lrm_resource id="rsc1"> inside <node_state id="node1">), and look at each <lrm_rsc_op> entry in it. If "operation" is "monitor", use the digests obtained using the monitor parameters, otherwise use the digests obtained from the first digest command. Replace any "op-digest" with the type="all" digest, any op-secure-digest with the type="nonprivate" digest, and any "op-restart-digest" with the type="nonreloadable" digest. 5d. Load the file back into the cluster. The risks of doing that are that it's easy to make a mistake and thus cause restarts to happen anyway, and that making such a change without separately ensuring the new values are effective for the service means the service will remain running with the old values. Also, the status section syntax is not guaranteed to stay the same across Pacemaker releases. Support for this feature can be determined by checking that Pacemaker's CRM feature set >= 3.6.4 > [root@virt-148 ~]# rpm -q pacemaker > pacemaker-2.0.5-5.el8.x86_64 Check that version of Pacemaker's CRM feature set >= 3.6.4: > [root@virt-148 ~]# pacemakerd --features > Pacemaker 2.0.5-5.el8 (Build: ba59be7122) > Supporting v3.7.0: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls cibsecrets Check man/help: > [root@virt-148 ~]# man crm_resource | grep -A 4 digests > --digests > (Advanced) Show parameter hashes that Pacemaker uses to detect configuration changes > (only accurate if there is resource history on the specified node). Required: > --resource, --node. Optional: any NAME=VALUE parameters will be used to override > the configuration (to see what the hash would be with those changes). > [root@virt-148 ~]# crm_resource --help-all | grep -A 5 digests > --digests (Advanced) Show parameter hashes that Pacemaker uses to detect > configuration changes (only accurate if there is resource > history on the specified node). Required: --resource, --node. > Optional: any NAME=VALUE parameters will be used to override > the configuration (to see what the hash would be with those > changes). > [root@virt-148 ~]# pcs status > Cluster name: STSRHTS31149 > Cluster Summary: > * Stack: corosync > * Current DC: virt-149 (version 2.0.5-5.el8-ba59be7122) - partition with quorum > * Last updated: Wed Feb 3 16:21:48 2021 > * Last change: Wed Feb 3 16:18:50 2021 by root via cibadmin on virt-148 > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ virt-148 virt-149 ] > Full List of Resources: > * fence-virt-148 (stonith:fence_xvm): Started virt-148 > * fence-virt-149 (stonith:fence_xvm): Started virt-149 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled Check resource configuration for a resource parameter that will be updated. In this reproducer attribute "delay" will be updated to 10s. > [root@virt-148 ~]# pcs stonith config fence-virt-148 > Resource: fence-virt-148 (class=stonith type=fence_xvm) > Attributes: delay=20 pcmk_host_check=static-list pcmk_host_list=virt-148 pcmk_host_map=virt-148:virt-148.cluster-qe.lab.eng.brq.redhat.com > Operations: monitor interval=60s (fence-virt-148-monitor-interval-60s) List all fence-virt-148 operations: > [root@virt-148 ~]# crm_resource --list-all-operations --resource fence-virt-148 > fence-virt-148 (stonith:fence_xvm): Started: fence-virt-148_monitor_0 (node=virt-149, call=5, rc=7, last-rc-change=Mon Feb 1 11:21:55 2021, exec=24ms): complete > fence-virt-148 (stonith:fence_xvm): Started: fence-virt-148_start_0 (node=virt-148, call=1363, rc=0, last-rc-change=Tue Feb 2 15:54:51 2021, exec=363ms): complete > fence-virt-148 (stonith:fence_xvm): Started: fence-virt-148_monitor_60000 (node=virt-148, call=1365, rc=0, last-rc-change=Tue Feb 2 15:54:51 2021, exec=401ms): complete Determine where resource is running: > [root@virt-148 ~]# crm_resource --locate --resource fence-virt-148 > resource fence-virt-148 is running on: virt-148 Show what the new digests would be for a one-time operation: > [root@virt-148 ~]# crm_resource --digests --output-as=xml --resource fence-virt-148 --node virt-148 delay=10 > <pacemaker-result api-version="2.3" request="crm_resource --digests --output-as=xml --resource fence-virt-148 --node virt-148 delay=10"> > <digests resource="fence-virt-148" node="virt-148" task="start" interval="0ms"> > <digest type="all" hash="80ac5753667128b490ba6fb7b4001d67"> > <parameters delay="10" pcmk_host_check="static-list" pcmk_host_map="virt-148:virt-148.cluster-qe.lab.eng.brq.redhat.com" pcmk_host_list="virt-148"/> > </digest> > <digest type="nonprivate" hash="bfd73b9a2527a3b3944d09490855b2f2"> > <parameters delay="10"/> > </digest> > </digests> > <status code="0" message="OK"/> > </pacemaker-result> Show what the new digests would be for the recurring monitor: > [root@virt-148 ~]# crm_resource --digests --output-as=xml --resource fence-virt-148 --node virt-148 delay=10 CRM_meta_interval=10000 CRM_meta_timeout=20000 > <pacemaker-result api-version="2.3" request="crm_resource --digests --output-as=xml --resource fence-virt-148 --node virt-148 delay=10 CRM_meta_interval=10000 CRM_meta_timeout=20000"> > <digests resource="fence-virt-148" node="virt-148" task="start" interval="10000ms"> > <digest type="all" hash="aa308fdc91fcd3c64737036870fd86f4"> > <parameters delay="10" pcmk_host_check="static-list" pcmk_host_map="virt-148:virt-148.cluster-qe.lab.eng.brq.redhat.com" pcmk_host_list="virt-148" CRM_meta_timeout="20000"/> > </digest> > <digest type="nonprivate" hash="bfd73b9a2527a3b3944d09490855b2f2"> > <parameters delay="10"/> > </digest> > </digests> > <status code="0" message="OK"/> > </pacemaker-result> Dump the entire CIB to a file: > [root@virt-148 ~]# pcs cluster cib > cib-original.xml > [root@virt-148 ~]# cp cib-original.xml cib-new.xml Edit the resource configuration as described in Comment 5 (section 5b, 5c): > [root@virt-148 ~]# vim cib-new.xml > [root@virt-148 ~]# diff cib-original.xml cib-new.xml > 20c20 > < <nvpair id="fence-virt-148-instance_attributes-delay" name="delay" value="20"/> > --- > > <nvpair id="fence-virt-148-instance_attributes-delay" name="delay" value="10"/> > 71,72c71,72 > < <lrm_rsc_op id="fence-virt-148_last_0" operation_key="fence-virt-148_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.7.0" transition-key="4:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" transition-magic="0:0;4:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" exit-reason="" on_node="virt-148" call-id="1363" rc-code="0" op-status="0" interval="0" last-rc-change="1612277691" last-run="1612277691" exec-time="363" queue-time="0" op-digest="bd02e4f8cfe532fb1c9d5807b72b193b"/> > < <lrm_rsc_op id="fence-virt-148_monitor_60000" operation_key="fence-virt-148_monitor_60000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.7.0" transition-key="2:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" transition-magic="0:0;2:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" exit-reason="" on_node="virt-148" call-id="1365" rc-code="0" op-status="0" interval="60000" last-rc-change="1612277691" exec-time="401" queue-time="0" op-digest="811d822164fd020e36ef256380f696d8"/> > --- > > <lrm_rsc_op id="fence-virt-148_last_0" operation_key="fence-virt-148_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.7.0" transition-key="4:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" transition-magic="0:0;4:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" exit-reason="" on_node="virt-148" call-id="1363" rc-code="0" op-status="0" interval="0" last-rc-change="1612277691" last-run="1612277691" exec-time="363" queue-time="0" op-digest="80ac5753667128b490ba6fb7b4001d67"/> > > <lrm_rsc_op id="fence-virt-148_monitor_60000" operation_key="fence-virt-148_monitor_60000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.7.0" transition-key="2:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" transition-magic="0:0;2:751:0:d9791968-ac53-4d94-9d23-b3ede3e40c27" exit-reason="" on_node="virt-148" call-id="1365" rc-code="0" op-status="0" interval="60000" last-rc-change="1612277691" exec-time="401" queue-time="0" op-digest="aa308fdc91fcd3c64737036870fd86f4"/> Load the file back into the cluster: > [root@virt-148 ~]# pcs cluster cib-push cib-new.xml diff-against=cib-original.xml > CIB updated Check that resource did not restart: > [root@virt-148 ~]# crm_resource --list-all-operations --resource fence-virt-148 > fence-virt-148 (stonith:fence_xvm): Started: fence-virt-148_monitor_0 (node=virt-149, call=5, rc=7, last-rc-change=Mon Feb 1 11:21:55 2021, exec=24ms): complete > fence-virt-148 (stonith:fence_xvm): Started: fence-virt-148_start_0 (node=virt-148, call=1363, rc=0, last-rc-change=Tue Feb 2 15:54:51 2021, exec=363ms): complete > fence-virt-148 (stonith:fence_xvm): Started: fence-virt-148_monitor_60000 (node=virt-148, call=1365, rc=0, last-rc-change=Tue Feb 2 15:54:51 2021, exec=401ms): complete > [root@virt-148 ~]# pcs status > Cluster name: STSRHTS31149 > Cluster Summary: > * Stack: corosync > * Current DC: virt-149 (version 2.0.5-5.el8-ba59be7122) - partition with quorum > * Last updated: Wed Feb 3 16:29:26 2021 > * Last change: Wed Feb 3 16:28:19 2021 by root via cibadmin on virt-148 > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ virt-148 virt-149 ] > Full List of Resources: > * fence-virt-148 (stonith:fence_xvm): Started virt-148 > * fence-virt-149 (stonith:fence_xvm): Started virt-149 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled Check that the resource attribute "delay" is updated with the new value: > [root@virt-148 ~]# pcs stonith config fence-virt-148 > Resource: fence-virt-148 (class=stonith type=fence_xvm) > Attributes: delay=10 pcmk_host_check=static-list pcmk_host_list=virt-148 pcmk_host_map=virt-148:virt-148.cluster-qe.lab.eng.brq.redhat.com > Operations: monitor interval=60s (fence-virt-148-monitor-interval-60s) marking verified in pacemaker-2.0.5-5.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |