Bug 1381754
Summary: | pacemaker / pacemaker_remote do not give a warning when /etc/pacemaker/authkey is not readable | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Andreas Karis <akaris> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | Ofer Blaut <oblaut> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.2 | CC: | abeekhof, cluster-maint, fdinitto, jpokorny, mnovacek |
Target Milestone: | rc | ||
Target Release: | 7.5 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-1.1.18-4.el7 | Doc Type: | No Doc Update |
Doc Text: |
Not a highly user-visible change
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-10 15:28:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andreas Karis
2016-10-04 22:41:39 UTC
In this context, would it make sense to also consider enforcing sane privileges of a key file (e.g., only readable by particular user/group), something which is commonly done elsewhere (sshd, booth)? There is a log message indicating that the key could not be read, but it is rather obscure to the end user, and pacemaker does not quickly detect and report the error in a helpful manner. Some sanity checking per Comment 3 would also be useful, though most likely we should only log a warning on problems, to avoid invalidating existing setups. This will not be ready in the 7.4 timeframe. Some notes: * pacemaker_remote runs as root, so it needs no special permissions to read the key file. The cluster node that hosts the remote connection will connect as the hacluster user, so on the cluster nodes at least, the key file needs to be readable by that. The key file should *not* be world-readable. * From the current docs: "As of Red Hat Enterprise Linux 7.4, the pcs cluster node add-guest command sets up the authkey for guest nodes and the pcs cluster node add-remote command sets up the authkey for remote nodes." If those commands are used, the correct permissions will be used. * For <7.4, the docs suggest manually doing "mkdir -p --mode=0750 /etc/pacemaker" and "chgrp haclient /etc/pacemaker" which will ensure that the entire directory is protected, so the authkey permissions can be left with the world readable bit on. * With the current 7.5 build, the error is logged on the cluster node hosting the remote connection. The following messages will repeat: Oct 25 14:15:43 rhel7-1 crmd[9918]: error: No valid lrmd remote key found at /etc/pacemaker/authkey Oct 25 14:15:43 rhel7-1 crmd[9918]: warning: Setup of the key failed (rc=-1) for remote node [...] The cluster will continue attempting connection in that manner until the remote start operation timeout is hit, at which point it shows a failure in the status, and it recovers according to the configured policy (by default, re-attempting the start on the same cluster node 1,000,000 times before moving on to another cluster node). * I think what the cluster *should* do (and would be the fix for this bz) is to return OCF_ERR_ARGS immediately (and set a meaningful exit reason) upon not being able to read the file, which would ban the current node from hosting the connection. Fixed upstream as of commit b9f61dd4 Test procedure (and description of changes): 1. Configure and start a cluster with a Pacemaker Remote node. 2. Test unavailable key on the remote node: 2a. Disable the remote resource, then stop pacemaker_remote on the remote node. 2b. Move the remote authentication key (default /etc/pacemaker/authkey) to a different location. (Simply changing the permissions is not enough because pacemaker_remote runs as root.) 2c. Start pacemaker_remote. 2d. Before the change, there will be no difference in the logs until the cluster attempts a connection (after re-enabling the resource). After the change, there will be log messages at start-up like: pacemaker_remoted: error: lrmd_tls_set_key: No valid lrmd remote key found at /etc/pacemaker/authkey pacemaker_remoted: warning: lrmd_init_remote_tls_server: A cluster connection will not be possible until the key is available The log will be the only difference. Both before and after the change, if the key is made available before the cluster connects, everything will be fine, and if not, the connection will repeatedly re-attempt and fail. 2e. When done testing, make sure the key is available again, and the remote resource is enabled. 3. Test unavailable key on a cluster node: 3a. Use a location constraint to prefer one particular cluster node for the remote connection resource, for ease of testing. 3b. Disable the remote connection resource. 3c. On the test node, make the key unavailable either by moving it to a different location or making it unreadable by the hacluster:haclient user/group. 3d. Re-enable the resource. Before the fix, the cluster node will repeatedly fail and re-attempt connection (as shown in its logs), until the remote connection's start operation timeout is reached, at which point cluster status will show the resource as failed, and recovery will proceed (by default, re-attempting on the same node 1,000,000 times). After the fix, the cluster status will immediately show a resource failure after the first failed connection attempt, like: Failed Actions: remote-rhel7-2_start_0 on rhel7-1 'invalid parameter' (2): call=8, status=Error, exitreason='Authentication key not readable', last-rc-change='Thu Oct 26 09:33:42 2017', queued=0ms, exec=0ms and recovery will not be attempted on the same node (the resource will move to another node if possible). I have verified that missing or unreadable /etc/pacemaker/authkey is clearly reported in the log for both cluster node and remote node with pacemaker-1.1.18-6.el7.x86_64 --- Common part: * Configure cluster to run virtual domains with public ip address able to run on both nodes of the cluster. * Disable all VirtualDomain resources * remove /etc/pacemaker/authkey on virtual nodes and on cluster nodes [root@bucek-01 ~]# date Wed Dec 6 15:13:18 CET 2017 cluster node test ================= > enable resource [root@bucek-01 ~]# grep lrmd /var/log/messages ... Dec 6 15:12:00 bucek-01 crmd[17694]: error: No valid lrmd remote key found at /etc/pacemaker/authkey Also shown in pcs status for both nodes: Failed Actions: * pool-10-37-166-86_start_0 on bucek-02 'invalid parameter' (2): call=17, status=Error, exitreason='Authentication key not readable', last-rc-change='Wed Dec 6 15:25:11 2017', queued=0ms, exec=0ms * pool-10-37-166-86_start_0 on bucek-01 'invalid parameter' (2): call=8, status=Error, exitreason='Authentication key not readable', last-rc-change='Wed Dec 6 15:25:07 2017', queued=0ms, exec=0ms [root@bucek-02 ~]# pcs resource ... R-pool-10-37-166-86 (ocf::heartbeat:VirtualDomain): Stopped > disable resource [root@bucek-02 ~]# pcs resource disable R-pool-10-37-166-86 [root@bucek-02 ~]# pcs resource ... R-pool-10-37-166-86 (ocf::heartbeat:VirtualDomain): Stopped (disabled) remote node test ================ # ls -l /etc/pacemaker/authekey ls: cannot access /etc/pacemaker/authekey: No such file or directory # systemctl is-active pacemaker_remote inactive # systemctl start pacemaker_remote # systemctl is-active pacemaker_remote active # systemctl status pacemaker_remote ● pacemaker_remote.service - Pacemaker Remote Service Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2017-12-06 09:52:06 EST; 8s ago Docs: man:pacemaker_remoted http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Remote/index.html Main PID: 1218 (pacemaker_remot) CGroup: /system.slice/pacemaker_remote.service └─1218 /usr/sbin/pacemaker_remoted Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com systemd[1]: Started Pacemaker Remote Service. Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com systemd[1]: Starting Pacemaker Remote Service... Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]: notice: Additional logging available in /var/log/pacemaker.log Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]: notice: Starting TLS listener on port 3121 > Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]: error: No valid lrmd remote key found at /etc/pacemaker/authkey > Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]: warning: A cluster connection will not be possible until the key is available Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]: notice: Listening on address :: --- > (1) pcs config [root@bucek-01 ~]# pcs config Cluster Name: STSRHTS27314 Corosync Nodes: bucek-01 bucek-02 Pacemaker Nodes: bucek-01 bucek-02 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Attributes: with_cmirrord=1 Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90 (clvmd-start-interval-0s) stop interval=0s timeout=90 (clvmd-stop-interval-0s) Clone: shared-vg-clone Meta Attrs: clone-max=2 interleave=true Resource: shared-vg (class=ocf provider=heartbeat type=LVM) Attributes: exclusive=false partial_activation=false volgrpname=shared Operations: methods interval=0s timeout=5 (shared-vg-methods-interval-0s) monitor interval=10 timeout=30 (shared-vg-monitor-interval-10) start interval=0s timeout=30 (shared-vg-start-interval-0s) stop interval=0s timeout=30 (shared-vg-stop-interval-0s) Clone: etc-libvirt-clone Meta Attrs: clone-max=2 interleave=true Resource: etc-libvirt (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/shared/etc0 directory=/etc/libvirt/qemu fstype=gfs2 options= Operations: monitor interval=30s (etc-libvirt-monitor-interval-30s) notify interval=0s timeout=60 (etc-libvirt-notify-interval-0s) start interval=0s timeout=60 (etc-libvirt-start-interval-0s) stop interval=0s timeout=60 (etc-libvirt-stop-interval-0s) Clone: images-clone Meta Attrs: clone-max=2 interleave=true Resource: images (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/shared/images0 directory=/var/lib/libvirt/images fstype=gfs2 options= Operations: monitor interval=30s (images-monitor-interval-30s) notify interval=0s timeout=60 (images-notify-interval-0s) start interval=0s timeout=60 (images-start-interval-0s) stop interval=0s timeout=60 (images-stop-interval-0s) Resource: R-pool-10-37-166-86 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/etc/libvirt/qemu/pool-10-37-166-86.xml hypervisor=qemu:///system Meta Attrs: remote-connect-timeout=60 remote-node=pool-10-37-166-86 target-role=Stopped Utilization: cpu=2 hv_memory=1024 Operations: migrate_from interval=0s timeout=60 (R-pool-10-37-166-86-migrate_from-interval-0s) migrate_to interval=0s timeout=120 (R-pool-10-37-166-86-migrate_to-interval-0s) monitor interval=10 timeout=30 (R-pool-10-37-166-86-monitor-interval-10) start interval=0s timeout=90 (R-pool-10-37-166-86-start-interval-0s) stop interval=0s timeout=90 (R-pool-10-37-166-86-stop-interval-0s) Stonith Devices: Resource: fence-bucek-01 (class=stonith type=fence_ipmilan) Attributes: delay=5 ipaddr=bucek-01-ilo login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=bucek-01 Operations: monitor interval=60s (fence-bucek-01-monitor-interval-60s) Resource: fence-bucek-02 (class=stonith type=fence_ipmilan) Attributes: ipaddr=bucek-02-ilo login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=bucek-02 Operations: monitor interval=60s (fence-bucek-02-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: clvmd-clone Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-clvmd-clone-R-pool-10-37-165-245--INFINITY) Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-clvmd-clone-R-pool-10-37-166-86--INFINITY) Resource: dlm-clone Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-dlm-clone-R-pool-10-37-165-245--INFINITY) Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-dlm-clone-R-pool-10-37-166-86--INFINITY) Resource: etc-libvirt-clone Enabled on: bucek-01 (score:INFINITY) (id:location-etc-libvirt-clone-bucek-01-INFINITY) Enabled on: bucek-02 (score:INFINITY) (id:location-etc-libvirt-clone-bucek-02-INFINITY) Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-etc-libvirt-clone-R-pool-10-37-165-245--INFINITY) Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-etc-libvirt-clone-R-pool-10-37-166-86--INFINITY) Resource: images-clone Enabled on: bucek-01 (score:INFINITY) (id:location-images-clone-bucek-01-INFINITY) Enabled on: bucek-02 (score:INFINITY) (id:location-images-clone-bucek-02-INFINITY) Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-images-clone-R-pool-10-37-165-245--INFINITY) Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-images-clone-R-pool-10-37-166-86--INFINITY) Resource: shared-vg-clone Enabled on: bucek-01 (score:INFINITY) (id:location-shared-vg-clone-bucek-01-INFINITY) Enabled on: bucek-02 (score:INFINITY) (id:location-shared-vg-clone-bucek-02-INFINITY) Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-shared-vg-clone-R-pool-10-37-165-245--INFINITY) Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-shared-vg-clone-R-pool-10-37-166-86--INFINITY) Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) start clvmd-clone then start shared-vg-clone (kind:Mandatory) start shared-vg-clone then start etc-libvirt-clone (kind:Mandatory) start shared-vg-clone then start images-clone (kind:Mandatory) start etc-libvirt-clone then start R-pool-10-37-166-86 (kind:Mandatory) start images-clone then start R-pool-10-37-166-86 (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) shared-vg-clone with clvmd-clone (score:INFINITY) images-clone with shared-vg-clone (score:INFINITY) etc-libvirt-clone with shared-vg-clone (score:INFINITY) R-pool-10-37-166-86 with images-clone (score:INFINITY) R-pool-10-37-166-86 with etc-libvirt-clone (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: resource-stickiness: 100 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS27314 dc-version: 1.1.18-6.el7-2b07d5c5a9 have-watchdog: false no-quorum-policy: freeze Quorum: Options: Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0860 |