Bug 1881537

Summary: Pacemaker Remote nodes cannot run resources that have CIB secrets configured
Product: Red Hat Enterprise Linux 8 Reporter: Markéta Smazová <msmazova>
Component: pacemakerAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 8.3CC: cluster-maint, kgaillot, phagara
Target Milestone: rcKeywords: Triaged
Target Release: 8.4Flags: pm-rhel: mirror+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-2.0.5-6.el8 Doc Type: No Doc Update
Doc Text:
The Pacemaker capability being fixed was added in 8.3, but the pcs interface is not yet available (or documented), so I think we are OK not mentioning this in documentation.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-18 15:26:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Markéta Smazová 2020-09-22 16:11:23 UTC
Description of problem:
When a resource has a secret parameter (set by a `cibsecret` command), such resource fails when running on a Pacemaker Remote node.

Version-Release number of selected component (if applicable):
pacemaker-2.0.4-6

How reproducible:
consistent

Steps to Reproduce:
1. Configure a cluster with at least one Pacemaker Remote node.
2. Configure a resource on the Pacemaker Remote node.
3. Set its parameter as a secret (use `cibsecret` command), make sure to run the `cibsecret` command from another cluster node (the command doesn't work on remote nodes).

Actual results:
Resource fails and is stopped. Resource remain stopped until its secret parameter is deleted or is set back to non-secret parameter.

Expected results:
Remote node gets access to resource secret parameter (eg. by querying a full node) and starts the resource.

Additional info:
If fixing is not possible, this should be documented as another known limitation in help/man page and also in future pcs interface.

Comment 1 Ken Gaillot 2020-09-22 17:44:03 UTC
The problem is that currently, Pacemaker's executor daemon (pacemaker-execd on cluster nodes and pacemaker-remoted on remote nodes) is the one that replaces the "lrm://" placeholders with the secret values. This means that the secrets must be available locally on the node running the executor daemon, but for remote nodes, that is not the case.

I believe the fix will be to make the controller daemon (pacemaker-controld) provide the secret values to the executor when requesting execution.

It would be simpler to make the controller substitute the secret values before requesting execution, but that would change the parameter hash used to detect configuration changes, causing affected resources to restart after a rolling upgrade. So instead, I think the controller can provide the parameters as currently (with "lrm://"), and provide the secret values in a separate (new) part of the request. The executor can then perform the substitution based on the provided values rather than on locally stored values, while still computing the same parameter hash.

This approach will mean that both the cluster nodes and the remote nodes must be running a Pacemaker version with the fix in order for it to work. If either has an older version, it will simply behave as currently.

Comment 3 Ken Gaillot 2020-10-06 23:21:35 UTC
The proposed fix in Comment 1 wouldn't work as described. The controller runs as hacluster, not root, so it doesn't have access to the secrets.

My next idea is to modify the cibsecret tool to sync secrets to Pacemaker Remote nodes. Pacemaker Remote nodes are more likely to be down when the secret is set and to not have ssh access from the cluster nodes, and the host's secrets will need to be exported into containers in order to work with bundles, but we can document those limitations.

Comment 4 Ken Gaillot 2021-01-28 21:02:11 UTC
The fix has been merged upstream as of commit 240b9ec0

We went with the approach of sync'ing secrets to remote and guest nodes. It does not work with bundles, though that capability could be added if demand arises. The remote or guest node must be available via ssh to its node name (specifically, if the node name is different from the local host name, the node name must be added to /etc/hosts or similar).

Comment 8 Patrik Hagara 2021-02-15 10:18:55 UTC
env: cluster consisting of 3 full nodes and 1 remote node, dummy resource with a location constraint on the remote node


before (pacemaker-2.0.4-6.el8_3.1)
==================================

> [root@virt-047 ~]# pcs status
> Cluster name: STSRHTS11440
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-047 (version 2.0.4-6.el8_3.1-2deceaa3ae) - partition with quorum
>   * Last updated: Mon Feb 15 10:50:26 2021
>   * Last change:  Mon Feb 15 10:50:09 2021 by root via cibadmin on virt-047
>   * 4 nodes configured
>   * 6 resource instances configured
> 
> Node List:
>   * Online: [ virt-047 virt-048 virt-049 ]
>   * RemoteOnline: [ virt-050 ]
> 
> Full List of Resources:
>   * fence-virt-047	(stonith:fence_xvm):	 Started virt-048
>   * fence-virt-048	(stonith:fence_xvm):	 Started virt-049
>   * fence-virt-049	(stonith:fence_xvm):	 Started virt-049
>   * fence-virt-050	(stonith:fence_xvm):	 Started virt-047
>   * virt-050	(ocf::pacemaker:remote):	 Started virt-047
>   * dummy	(ocf::pacemaker:Dummy):	 Started virt-050
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-047 ~]# pcs constraint
> Location Constraints:
>   Resource: dummy
>     Enabled on:
>       Node: virt-050 (score:INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:


change the dummy resource's delay attribute to be a secret:

> [root@virt-047 ~]# cibsecret set dummy delay 10
> INFO: syncing /var/lib/pacemaker/lrm/secrets/dummy/delay to  virt-048 virt-049  ...
> Set 'dummy' option: id=dummy-instance_attributes-delay set=dummy-instance_attributes name=delay value=lrm://

notice the secret was synced only to full nodes and not the remote node.

verify the secret attribute:

> [root@virt-047 ~]# cibsecret get dummy delay
> 10
> [root@virt-047 ~]# pcs resource config dummy
>  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
>   Attributes: delay=lrm://
>   Operations: migrate_from interval=0s timeout=20s (dummy-migrate_from-interval-0s)
>               migrate_to interval=0s timeout=20s (dummy-migrate_to-interval-0s)
>               monitor interval=10s timeout=20s (dummy-monitor-interval-10s)
>               reload interval=0s timeout=20s (dummy-reload-interval-0s)
>               start interval=0s timeout=20s (dummy-start-interval-0s)
>               stop interval=0s timeout=20s (dummy-stop-interval-0s)


dummy resource fails on the remote node due to not having access to the secret:

> [root@virt-047 ~]# pcs status
> Cluster name: STSRHTS11440
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-047 (version 2.0.4-6.el8_3.1-2deceaa3ae) - partition with quorum
>   * Last updated: Mon Feb 15 10:55:42 2021
>   * Last change:  Mon Feb 15 10:55:06 2021 by root via crm_resource on virt-047
>   * 4 nodes configured
>   * 6 resource instances configured
> 
> Node List:
>   * Online: [ virt-047 virt-048 virt-049 ]
>   * RemoteOnline: [ virt-050 ]
> 
> Full List of Resources:
>   * fence-virt-047	(stonith:fence_xvm):	 Started virt-048
>   * fence-virt-048	(stonith:fence_xvm):	 Started virt-049
>   * fence-virt-049	(stonith:fence_xvm):	 Started virt-049
>   * fence-virt-050	(stonith:fence_xvm):	 Started virt-047
>   * virt-050	(ocf::pacemaker:remote):	 Started virt-047
>   * dummy	(ocf::pacemaker:Dummy):	 Stopped
> 
> Failed Resource Actions:
>   * dummy_start_0 on virt-050 'not configured' (6): call=18, status='complete', exitreason='', last-rc-change='2021-02-15 10:55:06 +01:00', queued=0ms, exec=6ms
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled

result: remote nodes are unable host resources with secret attributes



after (pacemaker-2.0.5-6.el8)
=============================

> [root@virt-042 ~]# pcs status
> Cluster name: STSRHTS26313
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-043 (version 2.0.5-6.el8-ba59be7122) - partition with quorum
>   * Last updated: Mon Feb 15 11:03:13 2021
>   * Last change:  Mon Feb 15 11:02:17 2021 by root via cibadmin on virt-042
>   * 4 nodes configured
>   * 6 resource instances configured
> 
> Node List:
>   * Online: [ virt-042 virt-043 virt-044 ]
>   * RemoteOnline: [ virt-045 ]
> 
> Full List of Resources:
>   * fence-virt-042	(stonith:fence_xvm):	 Started virt-043
>   * fence-virt-043	(stonith:fence_xvm):	 Started virt-044
>   * fence-virt-044	(stonith:fence_xvm):	 Started virt-044
>   * fence-virt-045	(stonith:fence_xvm):	 Started virt-042
>   * virt-045	(ocf::pacemaker:remote):	 Started virt-042
>   * dummy	(ocf::pacemaker:Dummy):	 Started virt-045
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-042 ~]# pcs constraint
> Location Constraints:
>   Resource: dummy
>     Enabled on:
>       Node: virt-045 (score:INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:


change the dummy resource's delay attribute to be a secret:

> [root@virt-042 ~]# cibsecret set dummy delay 10
> INFO: syncing /var/lib/pacemaker/lrm/secrets/dummy/delay to  virt-043 virt-044 virt-045  ...
> Set 'dummy' option: id=dummy-instance_attributes-delay set=dummy-instance_attributes name=delay value=lrm://

notice the secret was synced to all nodes, including the remote one.

verify secret attribute:

> [root@virt-042 ~]# cibsecret get dummy delay
> 10
> [root@virt-042 ~]# pcs resource config dummy
>  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
>   Attributes: delay=lrm://
>   Operations: migrate_from interval=0s timeout=20s (dummy-migrate_from-interval-0s)
>               migrate_to interval=0s timeout=20s (dummy-migrate_to-interval-0s)
>               monitor interval=10s timeout=20s (dummy-monitor-interval-10s)
>               reload interval=0s timeout=20s (dummy-reload-interval-0s)
>               start interval=0s timeout=20s (dummy-start-interval-0s)
>               stop interval=0s timeout=20s (dummy-stop-interval-0s)


verify the dummy resource is happily running:

> [root@virt-042 ~]# pcs status
> Cluster name: STSRHTS26313
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-043 (version 2.0.5-6.el8-ba59be7122) - partition with quorum
>   * Last updated: Mon Feb 15 11:04:12 2021
>   * Last change:  Mon Feb 15 11:03:56 2021 by root via crm_resource on virt-042
>   * 4 nodes configured
>   * 6 resource instances configured
> 
> Node List:
>   * Online: [ virt-042 virt-043 virt-044 ]
>   * RemoteOnline: [ virt-045 ]
> 
> Full List of Resources:
>   * fence-virt-042	(stonith:fence_xvm):	 Started virt-043
>   * fence-virt-043	(stonith:fence_xvm):	 Started virt-044
>   * fence-virt-044	(stonith:fence_xvm):	 Started virt-044
>   * fence-virt-045	(stonith:fence_xvm):	 Started virt-042
>   * virt-045	(ocf::pacemaker:remote):	 Started virt-042
>   * dummy	(ocf::pacemaker:Dummy):	 Started virt-045
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled


the logs show that the resource configuration changed and was reloaded.

log excerpt from a full node:

> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Forwarding cib_modify operation for section resources to all (origin=local/crm_resource/6)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: --- 0.15.9 2
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: +++ 0.16.0 cce549ac945c5a82c0b01d029f486781
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib:  @epoch=16, @num_updates=0
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: ++ /cib/configuration/resources/primitive[@id='dummy']:  <instance_attributes id="dummy-instance_attributes"/>
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: ++                                                         <nvpair id="dummy-instance_attributes-delay" name="delay" value="lrm://"/>
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: ++                                                       </instance_attributes>
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Completed cib_modify operation for section resources: OK (rc=0, origin=virt-042/crm_resource/6, version=0.16.0)
> Feb 15 11:03:56 virt-042 pacemaker-fenced    [49708] (update_cib_stonith_devices_v2)    info: Updating device list from the cib: create primitive[@id='dummy']
> Feb 15 11:03:56 virt-042 pacemaker-fenced    [49708] (cib_devices_update)       info: Updating devices to version 0.16.0
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_file_backup)  info: Archived previous version as /var/lib/pacemaker/cib/cib-17.raw
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_file_write_with_digest)       info: Wrote version 0.16.0 of the CIB to disk (digest: 9098cee6303e9bd84023dd3c701730fc)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_file_write_with_digest)       info: Reading cluster configuration file /var/lib/pacemaker/cib/cib.uQ1cr1 (digest: /var/lib/pacemaker/cib/cib.Q467K0)
> Feb 15 11:03:56 virt-042 pacemaker-controld  [49712] (lrmd_tls_recv_reply)      info: queueing notify
> Feb 15 11:03:56 virt-042 pacemaker-controld  [49712] (lrmd_tls_recv_reply)      info: notify trigger set.
> Feb 15 11:03:56 virt-042 pacemaker-controld  [49712] (do_lrm_rsc_op)    notice: Requesting local execution of reload operation for dummy on virt-045 | transition_key=7:8:0:41ebb136-fb99-4de9-a120-949760e3b442 op_key=dummy_reload_0
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Forwarding cib_modify operation for section status to all (origin=local/crmd/55)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: --- 0.16.0 2
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: +++ 0.16.1 (null)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib:  @num_updates=1
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib/status/node_state[@id='virt-045']:  @crm-debug-origin=do_update_resource
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib/status/node_state[@id='virt-045']/lrm[@id='virt-045']/lrm_resources/lrm_resource[@id='dummy']/lrm_rsc_op[@id='dummy_last_0']:  @operation_key=dummy_monitor_0, @operation=monitor, @transition-key=7:8:0:41ebb136-fb99-4de9-a120-949760e3b442, @transition-magic=-1:193;7:8:0:41ebb136-fb99-4de9-a120-949760e3b442, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1613383436, @last-run=1613
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-042/crmd/55, version=0.16.1)
> Feb 15 11:03:56 virt-042 pacemaker-controld  [49712] (process_lrm_event)        info: Result of monitor operation for dummy on virt-045: Cancelled | call=8 key=dummy_monitor_10000 confirmed=true
> Feb 15 11:03:56 virt-042 pacemaker-controld  [49712] (process_lrm_event)        notice: Result of reload operation for dummy on virt-045: ok | rc=0 call=12 key=dummy_reload_0 confirmed=true cib-update=56
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Forwarding cib_modify operation for section status to all (origin=local/crmd/56)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: --- 0.16.1 2
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: +++ 0.16.2 (null)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib:  @num_updates=2
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib/status/node_state[@id='virt-045']/lrm[@id='virt-045']/lrm_resources/lrm_resource[@id='dummy']/lrm_rsc_op[@id='dummy_last_0']:  @operation_key=dummy_start_0, @operation=start, @transition-magic=0:0;7:8:0:41ebb136-fb99-4de9-a120-949760e3b442, @call-id=12, @rc-code=0, @op-status=0, @exec-time=43
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-042/crmd/56, version=0.16.2)
> Feb 15 11:03:56 virt-042 pacemaker-controld  [49712] (do_lrm_rsc_op)    notice: Requesting local execution of monitor operation for dummy on virt-045 | transition_key=6:8:0:41ebb136-fb99-4de9-a120-949760e3b442 op_key=dummy_monitor_10000
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Forwarding cib_modify operation for section status to all (origin=local/crmd/57)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: --- 0.16.2 2
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: +++ 0.16.3 (null)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib:  @num_updates=3
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib/status/node_state[@id='virt-045']/lrm[@id='virt-045']/lrm_resources/lrm_resource[@id='dummy']/lrm_rsc_op[@id='dummy_monitor_10000']:  @transition-key=6:8:0:41ebb136-fb99-4de9-a120-949760e3b442, @transition-magic=-1:193;6:8:0:41ebb136-fb99-4de9-a120-949760e3b442, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1613383436, @exec-time=0, @op-digest=9e2e2def26ff3a0cea4121713c3108b4
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-042/crmd/57, version=0.16.3)
> Feb 15 11:03:56 virt-042 pacemaker-controld  [49712] (process_lrm_event)        notice: Result of monitor operation for dummy on virt-045: ok | rc=0 call=13 key=dummy_monitor_10000 confirmed=false cib-update=58
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Forwarding cib_modify operation for section status to all (origin=local/crmd/58)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: --- 0.16.3 2
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: Diff: +++ 0.16.4 (null)
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib:  @num_updates=4
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_perform_op)   info: +  /cib/status/node_state[@id='virt-045']/lrm[@id='virt-045']/lrm_resources/lrm_resource[@id='dummy']/lrm_rsc_op[@id='dummy_monitor_10000']:  @transition-magic=0:0;6:8:0:41ebb136-fb99-4de9-a120-949760e3b442, @call-id=13, @rc-code=0, @op-status=0, @exec-time=26
> Feb 15 11:03:56 virt-042 pacemaker-based     [49707] (cib_process_request)      info: Completed cib_modify operation for section status: OK (rc=0, origin=virt-042/crmd/58, version=0.16.4)

and on the remote node:

> Feb 15 11:03:56 virt-045 pacemaker-remoted   [55200] (cancel_recurring_action) 	info: Cancelling ocf operation dummy_monitor_10000
> Feb 15 11:03:56 virt-045 pacemaker-remoted   [55200] (log_execute) 	info: executing - rsc:dummy action:reload call_id:12
> Feb 15 11:03:56  Dummy(dummy)[55414]:    ERROR: Reloading...
> Feb 15 11:03:56 virt-045 pacemaker-remoted   [55200] (log_finished) 	info: dummy reload (call 12, PID 55414) exited with status 0 (execution time 43ms, queue time 0ms)

result: remote nodes can successfully host resources with secret atrributes configured.

Comment 10 errata-xmlrpc 2021-05-18 15:26:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782

Comment 11 errata-xmlrpc 2021-05-18 15:46:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782