Bug 1381754

Summary: pacemaker / pacemaker_remote do not give a warning when /etc/pacemaker/authkey is not readable
Product: Red Hat Enterprise Linux 7 Reporter: Andreas Karis <akaris>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: Ofer Blaut <oblaut>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: abeekhof, cluster-maint, fdinitto, jpokorny, mnovacek
Target Milestone: rc   
Target Release: 7.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.18-4.el7 Doc Type: No Doc Update
Doc Text:
Not a highly user-visible change
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 15:28:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Karis 2016-10-04 22:41:39 UTC
Description of problem:
pacemaker / pacemaker_remote do not give a warning when /etc/pacemaker/authkey is not readable

Version-Release number of selected component (if applicable):
latest

How reproducible:
all the time

Steps to Reproduce:
1. create /etc/pacemaker/authkey as root
2. remove world readable permissions
3.

Actual results:
pacemaker remote fails without a warning

Expected results:
pacemaker should clearly complain in the logs that it cannot access /etc/pacemaker/authkey

Additional info:

Comment 3 Jan Pokorný [poki] 2016-11-08 15:18:24 UTC
In this context, would it make sense to also consider enforcing sane
privileges of a key file (e.g., only readable by particular user/group),
something which is commonly done elsewhere (sshd, booth)?

Comment 4 Ken Gaillot 2017-01-31 15:53:21 UTC
There is a log message indicating that the key could not be read, but it is rather obscure to the end user, and pacemaker does not quickly detect and report the error in a helpful manner. Some sanity checking per Comment 3 would also be useful, though most likely we should only log a warning on problems, to avoid invalidating existing setups.

Comment 5 Ken Gaillot 2017-03-06 23:28:25 UTC
This will not be ready in the 7.4 timeframe.

Comment 6 Ken Gaillot 2017-10-25 19:22:17 UTC
Some notes:

* pacemaker_remote runs as root, so it needs no special permissions to read the key file. The cluster node that hosts the remote connection will connect as the hacluster user, so on the cluster nodes at least, the key file needs to be readable by that. The key file should *not* be world-readable.

* From the current docs: "As of Red Hat Enterprise Linux 7.4, the pcs cluster node add-guest command sets up the authkey for guest nodes and the pcs cluster node add-remote command sets up the authkey for remote nodes." If those commands are used, the correct permissions will be used.

* For <7.4, the docs suggest manually doing "mkdir -p --mode=0750 /etc/pacemaker"
and "chgrp haclient /etc/pacemaker" which will ensure that the entire directory is protected, so the authkey permissions can be left with the world readable bit on.

* With the current 7.5 build, the error is logged on the cluster node hosting the remote connection. The following messages will repeat:

Oct 25 14:15:43 rhel7-1 crmd[9918]:   error: No valid lrmd remote key found at /etc/pacemaker/authkey
Oct 25 14:15:43 rhel7-1 crmd[9918]: warning: Setup of the key failed (rc=-1) for remote node [...]

The cluster will continue attempting connection in that manner until the remote start operation timeout is hit, at which point it shows a failure in the status, and it recovers according to the configured policy (by default, re-attempting the start on the same cluster node 1,000,000 times before moving on to another cluster node).

* I think what the cluster *should* do (and would be the fix for this bz) is to return OCF_ERR_ARGS immediately (and set a meaningful exit reason) upon not being able to read the file, which would ban the current node from hosting the connection.

Comment 7 Ken Gaillot 2017-10-27 16:08:30 UTC
Fixed upstream as of commit b9f61dd4

Test procedure (and description of changes):

1. Configure and start a cluster with a Pacemaker Remote node.

2. Test unavailable key on the remote node:
2a. Disable the remote resource, then stop pacemaker_remote on the remote node.
2b. Move the remote authentication key (default /etc/pacemaker/authkey) to a different location. (Simply changing the permissions is not enough because pacemaker_remote runs as root.)
2c. Start pacemaker_remote.
2d. Before the change, there will be no difference in the logs until the cluster attempts a connection (after re-enabling the resource). After the change, there will be log messages at start-up like:

pacemaker_remoted:    error: lrmd_tls_set_key:   No valid lrmd remote key found at /etc/pacemaker/authkey
pacemaker_remoted:  warning: lrmd_init_remote_tls_server:        A cluster connection will not be possible until the key is available

The log will be the only difference. Both before and after the change, if the key is made available before the cluster connects, everything will be fine, and if not, the connection will repeatedly re-attempt and fail.
2e. When done testing, make sure the key is available again, and the remote resource is enabled.

3. Test unavailable key on a cluster node:
3a. Use a location constraint to prefer one particular cluster node for the remote connection resource, for ease of testing.
3b. Disable the remote connection resource.
3c. On the test node, make the key unavailable either by moving it to a different location or making it unreadable by the hacluster:haclient user/group.
3d. Re-enable the resource. Before the fix, the cluster node will repeatedly fail and re-attempt connection (as shown in its logs), until the remote connection's start operation timeout is reached, at which point cluster status will show the resource as failed, and recovery will proceed (by default, re-attempting on the same node 1,000,000 times). After the fix, the cluster status will immediately show a resource failure after the first failed connection attempt, like:

Failed Actions:

    remote-rhel7-2_start_0 on rhel7-1 'invalid parameter' (2):
    call=8, status=Error,
    exitreason='Authentication key not readable',
    last-rc-change='Thu Oct 26 09:33:42 2017', queued=0ms, exec=0ms

and recovery will not be attempted on the same node (the resource will move to another node if possible).

Comment 9 michal novacek 2017-12-06 14:57:46 UTC
I have verified that missing or unreadable /etc/pacemaker/authkey is clearly
reported in the log for both cluster node and remote node with
pacemaker-1.1.18-6.el7.x86_64

---

Common part:

* Configure cluster to run virtual domains with public ip address able to run
    on both nodes of the cluster.

* Disable all VirtualDomain resources

* remove /etc/pacemaker/authkey on virtual nodes and on cluster nodes

[root@bucek-01 ~]# date
Wed Dec  6 15:13:18 CET 2017

cluster node test
=================

> enable resource

[root@bucek-01 ~]# grep lrmd /var/log/messages
...
Dec  6 15:12:00 bucek-01 crmd[17694]:   error: No valid lrmd remote key found at /etc/pacemaker/authkey

Also shown in pcs status for both nodes: 
Failed Actions:
* pool-10-37-166-86_start_0 on bucek-02 'invalid parameter' (2): call=17, status=Error, exitreason='Authentication key not readable',
    last-rc-change='Wed Dec  6 15:25:11 2017', queued=0ms, exec=0ms
* pool-10-37-166-86_start_0 on bucek-01 'invalid parameter' (2): call=8, status=Error, exitreason='Authentication key not readable',
    last-rc-change='Wed Dec  6 15:25:07 2017', queued=0ms, exec=0ms

[root@bucek-02 ~]# pcs resource 
...
 R-pool-10-37-166-86    (ocf::heartbeat:VirtualDomain): Stopped


> disable resource
[root@bucek-02 ~]# pcs resource disable R-pool-10-37-166-86
[root@bucek-02 ~]# pcs resource 
...
 R-pool-10-37-166-86    (ocf::heartbeat:VirtualDomain): Stopped (disabled)


remote node test
================
# ls -l /etc/pacemaker/authekey
ls: cannot access /etc/pacemaker/authekey: No such file or directory

# systemctl is-active pacemaker_remote
inactive

# systemctl start pacemaker_remote

# systemctl is-active pacemaker_remote
active

# systemctl status pacemaker_remote
● pacemaker_remote.service - Pacemaker Remote Service
   Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2017-12-06 09:52:06 EST; 8s ago
     Docs: man:pacemaker_remoted
           http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Remote/index.html
 Main PID: 1218 (pacemaker_remot)
   CGroup: /system.slice/pacemaker_remote.service
           └─1218 /usr/sbin/pacemaker_remoted

Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com systemd[1]: Started Pacemaker Remote Service.
Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com systemd[1]: Starting Pacemaker Remote Service...
Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]:   notice: Additional logging available in /var/log/pacemaker.log
Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]:   notice: Starting TLS listener on port 3121
> Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]:    error: No valid lrmd remote key found at /etc/pacemaker/authkey
> Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]:  warning: A cluster connection will not be possible until the key is available
Dec 06 09:52:06 pool-10-37-166-86.cluster-qe.lab.eng.brq.redhat.com pacemaker_remoted[1218]:   notice: Listening on address ::

---

> (1) pcs config
[root@bucek-01 ~]# pcs config
Cluster Name: STSRHTS27314
Corosync Nodes:
 bucek-01 bucek-02
Pacemaker Nodes:
 bucek-01 bucek-02

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
               start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Attributes: with_cmirrord=1
   Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
               start interval=0s timeout=90 (clvmd-start-interval-0s)
               stop interval=0s timeout=90 (clvmd-stop-interval-0s)
 Clone: shared-vg-clone
  Meta Attrs: clone-max=2 interleave=true 
  Resource: shared-vg (class=ocf provider=heartbeat type=LVM)
   Attributes: exclusive=false partial_activation=false volgrpname=shared
   Operations: methods interval=0s timeout=5 (shared-vg-methods-interval-0s)
               monitor interval=10 timeout=30 (shared-vg-monitor-interval-10)
               start interval=0s timeout=30 (shared-vg-start-interval-0s)
               stop interval=0s timeout=30 (shared-vg-stop-interval-0s)
 Clone: etc-libvirt-clone
  Meta Attrs: clone-max=2 interleave=true 
  Resource: etc-libvirt (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/shared/etc0 directory=/etc/libvirt/qemu fstype=gfs2 options=
   Operations: monitor interval=30s (etc-libvirt-monitor-interval-30s)
               notify interval=0s timeout=60 (etc-libvirt-notify-interval-0s)
               start interval=0s timeout=60 (etc-libvirt-start-interval-0s)
               stop interval=0s timeout=60 (etc-libvirt-stop-interval-0s)
 Clone: images-clone
  Meta Attrs: clone-max=2 interleave=true 
  Resource: images (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/shared/images0 directory=/var/lib/libvirt/images fstype=gfs2 options=
   Operations: monitor interval=30s (images-monitor-interval-30s)
               notify interval=0s timeout=60 (images-notify-interval-0s)
               start interval=0s timeout=60 (images-start-interval-0s)
               stop interval=0s timeout=60 (images-stop-interval-0s)
 Resource: R-pool-10-37-166-86 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/etc/libvirt/qemu/pool-10-37-166-86.xml hypervisor=qemu:///system
  Meta Attrs: remote-connect-timeout=60 remote-node=pool-10-37-166-86 target-role=Stopped 
  Utilization: cpu=2 hv_memory=1024
  Operations: migrate_from interval=0s timeout=60 (R-pool-10-37-166-86-migrate_from-interval-0s)
              migrate_to interval=0s timeout=120 (R-pool-10-37-166-86-migrate_to-interval-0s)
              monitor interval=10 timeout=30 (R-pool-10-37-166-86-monitor-interval-10)
              start interval=0s timeout=90 (R-pool-10-37-166-86-start-interval-0s)
              stop interval=0s timeout=90 (R-pool-10-37-166-86-stop-interval-0s)

Stonith Devices:
 Resource: fence-bucek-01 (class=stonith type=fence_ipmilan)
  Attributes: delay=5 ipaddr=bucek-01-ilo login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=bucek-01
  Operations: monitor interval=60s (fence-bucek-01-monitor-interval-60s)
 Resource: fence-bucek-02 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=bucek-02-ilo login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=bucek-02
  Operations: monitor interval=60s (fence-bucek-02-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: clvmd-clone
    Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-clvmd-clone-R-pool-10-37-165-245--INFINITY)
    Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-clvmd-clone-R-pool-10-37-166-86--INFINITY)
  Resource: dlm-clone
    Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-dlm-clone-R-pool-10-37-165-245--INFINITY)
    Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-dlm-clone-R-pool-10-37-166-86--INFINITY)
  Resource: etc-libvirt-clone
    Enabled on: bucek-01 (score:INFINITY) (id:location-etc-libvirt-clone-bucek-01-INFINITY)
    Enabled on: bucek-02 (score:INFINITY) (id:location-etc-libvirt-clone-bucek-02-INFINITY)
    Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-etc-libvirt-clone-R-pool-10-37-165-245--INFINITY)
    Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-etc-libvirt-clone-R-pool-10-37-166-86--INFINITY)
  Resource: images-clone
    Enabled on: bucek-01 (score:INFINITY) (id:location-images-clone-bucek-01-INFINITY)
    Enabled on: bucek-02 (score:INFINITY) (id:location-images-clone-bucek-02-INFINITY)
    Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-images-clone-R-pool-10-37-165-245--INFINITY)
    Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-images-clone-R-pool-10-37-166-86--INFINITY)
  Resource: shared-vg-clone
    Enabled on: bucek-01 (score:INFINITY) (id:location-shared-vg-clone-bucek-01-INFINITY)
    Enabled on: bucek-02 (score:INFINITY) (id:location-shared-vg-clone-bucek-02-INFINITY)
    Disabled on: R-pool-10-37-165-245 (score:-INFINITY) (id:location-shared-vg-clone-R-pool-10-37-165-245--INFINITY)
    Disabled on: R-pool-10-37-166-86 (score:-INFINITY) (id:location-shared-vg-clone-R-pool-10-37-166-86--INFINITY)
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
  start clvmd-clone then start shared-vg-clone (kind:Mandatory)
  start shared-vg-clone then start etc-libvirt-clone (kind:Mandatory)
  start shared-vg-clone then start images-clone (kind:Mandatory)
  start etc-libvirt-clone then start R-pool-10-37-166-86 (kind:Mandatory)
  start images-clone then start R-pool-10-37-166-86 (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
  shared-vg-clone with clvmd-clone (score:INFINITY)
  images-clone with shared-vg-clone (score:INFINITY)
  etc-libvirt-clone with shared-vg-clone (score:INFINITY)
  R-pool-10-37-166-86 with images-clone (score:INFINITY)
  R-pool-10-37-166-86 with etc-libvirt-clone (score:INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: 100
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS27314
 dc-version: 1.1.18-6.el7-2b07d5c5a9
 have-watchdog: false
 no-quorum-policy: freeze

Quorum:
  Options:

Comment 12 errata-xmlrpc 2018-04-10 15:28:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0860