Bug 1564536

Summary: Cluster should not start with empty configuration if /var/lib/pacemaker/* subtree has wrong permissions
Product: Red Hat Enterprise Linux 7 Reporter: Ondrej Benes <obenes>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.4CC: abeekhof, cluster-maint, mnovacek, sbradley
Target Milestone: rc   
Target Release: 7.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.18-13.el7 Doc Type: No Doc Update
Doc Text:
Most users will never encounter this issue
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 07:57:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ondrej Benes 2018-04-06 14:54:46 UTC
Description of problem: 
If /var/lib/pacemaker/* subtree has wrong permissions (such as ownership root:root instead of hacluster), pacemaker will start cluster with empty configuration. From corosync point of view, such a cluster membership is formed correctly (corosync configuration is untouched), but resources previously configured are not visible (cib could not be read). 

This way, each node participates in quorum vote but is not able to work with resources or fence any other node (no stonith devices defined due to no cib). If other node goes down we remain quorate because corosync has vote.

A possible way to tackle this might be to try to fix the access problem several times but if after these attempts it fails the cluster should stop on that node instead keeping running. The fact that cluster fails to fix the access problem means that it will likely not fix it even if we keep trying later.


 
Version-Release number of selected component (if applicable):

pacemaker-1.1.16-12.el7.x86_64

How reproducible:


Steps to Reproduce:
1. # chown root:root /var/lib/pacemaker -R
2. start cluster

Actual results:

cluster will start with empty config: "cib:  warning: readCibXmlFile:    Continuing with an empty configuration."

===========
  Feb 05 07:21:22 [2942] node2        cib:    error: crm_is_writable:   /var/lib/pacemaker/cib/cib.xml must be owned and r/w by user hacluster
  Feb 05 07:21:22 [2942] node2        cib:     info: retrieveCib:       Reading cluster configuration file /var/lib/pacemaker/cib/cib.xml (digest: /var/lib/pacemaker/cib/cib.xml.sig)
  Feb 05 07:21:22 [2942] node2        cib:    error: crm_xml_err:       XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:22 [2942] node2        cib:    error: filename2xml:      Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:22 [2942] node2        cib:    error: filename2xml:      Couldn't parse /var/lib/pacemaker/cib/cib.xml
  Feb 05 07:21:22 [2942] node2        cib:  warning: cib_file_read_and_verify:  Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML)
  Feb 05 07:21:22 [2942] node2        cib:  warning: retrieveCib:       Continuing but /var/lib/pacemaker/cib/cib.xml will NOT be used.
  ...
  Feb 05 07:21:22 [2942] node2        cib:  warning: readCibXmlFile:    Continuing with an empty configuration.
  ...
  Feb 05 07:21:22 [2942] node2        cib:    error: crm_xml_err:       XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:22 [2942] node2        cib:    error: filename2xml:      Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:22 [2942] node2        cib:  warning: cib_file_read_and_verify:  Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML)
  Feb 05 07:21:22 [2942] node2        cib:    error: cib_file_write_with_digest:        /var/lib/pacemaker/cib/cib.xml was manually modified while the cluster was active!
  Feb 05 07:21:22 [2942] node2        cib:    error: cib_diskwrite_complete:    Disk write process exited (pid=2961, rc=208)
  Feb 05 07:21:22 [2942] node2        cib:    error: cib_diskwrite_complete:    Disabling disk writes after write failure
==========
  Feb 05 07:21:35 [2928] node3        cib:    error: crm_is_writable:   /var/lib/pacemaker/cib/cib.xml must be owned and r/w by user hacluster
  Feb 05 07:21:35 [2928] node3        cib:     info: retrieveCib:       Reading cluster configuration file /var/lib/pacemaker/cib/cib.xml (digest: /var/lib/pacemaker/cib/cib.xml.sig)
  Feb 05 07:21:35 [2928] node3        cib:    error: crm_xml_err:       XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:35 [2928] node3        cib:    error: filename2xml:      Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:35 [2928] node3        cib:    error: filename2xml:      Couldn't parse /var/lib/pacemaker/cib/cib.xml
  Feb 05 07:21:35 [2928] node3        cib:  warning: cib_file_read_and_verify:  Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML)
  Feb 05 07:21:35 [2928] node3        cib:  warning: retrieveCib:       Continuing but /var/lib/pacemaker/cib/cib.xml will NOT be used.
  Feb 05 07:21:35 [2928] node3        cib:  warning: readCibXmlFile:    Primary configuration corrupt or unusable, trying backups in /var/lib/pacemaker/cib
  ...
  Feb 05 07:21:35 [2928] node3        cib:  warning: readCibXmlFile:    Continuing with an empty configuration.
  ...
  Feb 05 07:21:35 [2928] node3        cib:    error: crm_xml_err:       XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:35 [2928] node3        cib:    error: filename2xml:      Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:21:35 [2928] node3        cib:    error: filename2xml:      Couldn't parse /var/lib/pacemaker/cib/cib.xml
  Feb 05 07:21:35 [2928] node3        cib:  warning: cib_file_read_and_verify:  Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML)
  Feb 05 07:21:35 [2928] node3        cib:    error: cib_file_write_with_digest:        /var/lib/pacemaker/cib/cib.xml was manually modified while the cluster was active!
  Feb 05 07:21:35 [2928] node3        cib:    error: cib_diskwrite_complete:    Disk write process exited (pid=2951, rc=208)
  Feb 05 07:21:35 [2928] node3        cib:    error: cib_diskwrite_complete:    Disabling disk writes after write failure
==========
  Feb 05 07:20:51 [3031] node1        cib:    error: crm_is_writable:   /var/lib/pacemaker/cib/cib.xml must be owned and r/w by user hacluster
  Feb 05 07:20:51 [3031] node1        cib:     info: retrieveCib:       Reading cluster configuration file /var/lib/pacemaker/cib/cib.xml (digest: /var/lib/pacemaker/cib/cib.xml.sig)
  Feb 05 07:20:51 [3031] node1        cib:    error: crm_xml_err:       XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:20:51 [3031] node1        cib:    error: filename2xml:      Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:20:51 [3031] node1        cib:    error: filename2xml:      Couldn't parse /var/lib/pacemaker/cib/cib.xml
  Feb 05 07:20:51 [3031] node1        cib:  warning: cib_file_read_and_verify:  Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML)
  Feb 05 07:20:51 [3031] node1        cib:  warning: retrieveCib:       Continuing but /var/lib/pacemaker/cib/cib.xml will NOT be used.
  ...
  Feb 05 07:20:51 [3031] node1        cib:  warning: readCibXmlFile:    Continuing with an empty configuration.
  ...
  Feb 05 07:20:51 [3031] node1        cib:    error: crm_xml_err:       XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:20:51 [3031] node1        cib:    error: filename2xml:      Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
  Feb 05 07:20:51 [3031] node1        cib:    error: filename2xml:      Couldn't parse /var/lib/pacemaker/cib/cib.xml
  Feb 05 07:20:51 [3031] node1        cib:  warning: cib_file_read_and_verify:  Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML)
  Feb 05 07:20:51 [3031] node1        cib:    error: cib_file_write_with_digest:        /var/lib/pacemaker/cib/cib.xml was manually modified while the cluster was active!
  Feb 05 07:20:51 [3031] node1        cib:    error: cib_diskwrite_complete:    Disk write process exited (pid=3058, rc=208)
  Feb 05 07:20:51 [3031] node1        cib:    error: cib_diskwrite_complete:    Disabling disk writes after write failure



Expected results:

If cluster can't fix permissions of /var/lib/pacemaker/* it should not be started on that node.

Additional info:

Comment 2 Ken Gaillot 2018-04-27 21:25:13 UTC
The cib does not run as root, so it cannot try to correct the permissions, but we can exit and stay down rather than continue with an empty CIB.

Note that currently pacemaker never tells corosync to shut down, so the node could continue to survive in the corosync membership and show up as "pending" in the pacemaker membership (resources will not be scheduled on it). This might be changed in the future so that pacemaker tells corosync to shut down as well.

QA: You can test various permission scenarios. The cib should exit with an appropriate log message if /var/lib/pacemaker/cib does not exist or is not writeable by at least one of hacluster user and haclient group, and should exit if cib.xml exists but is not writeable by at least one of hacluster/haclient.

Comment 4 Ken Gaillot 2018-06-01 15:37:30 UTC
Fix is upstream as of commit 96c8d58f

FYI the issue of corosync not shutting down when pacemaker shuts down abnormally is addressed by Bug 1448221

Comment 6 michal novacek 2018-08-01 13:19:37 UTC
I have verified that pacemaker will not start when /var/lib/pacemaker is not
readable by user hacluster or group haclient with
pacemaker-1.1.19-3.el7.x86_64.

----

[root@host-002 pacemaker]# chown -R root:root /var/lib/pacemaker

[root@host-002 pacemaker]# pcs status
Error: cluster is not currently running on this node


Before the fix (1.1.18-12.el7.x86_64)
-------------------------------------
[root@host-003 ~]# pcs cluster start
Starting Cluster (corosync)...
Starting Cluster (pacemaker)...

[root@host-003 ~]# pcs status
Cluster name: STSRHTS5529

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: unknown
Current DC: NONE
Last updated: Wed Aug  1 08:12:41 2018
Last change: Wed Aug  1 08:12:34 2018 by hacluster via crmd on host-003

2 nodes configured
0 resources configured

Node host-002: UNCLEAN (offline)
Node host-003: UNCLEAN (offline)

No resources


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

[root@host-003 ~]# cat /var/log/messages
...
Aug  1 08:12:33 host-003 cib[32262]:  notice: Connecting to cluster infrastructure: corosync
Aug  1 08:12:33 host-003 cib[32262]:  notice: Node host-003 state is now member
>> Aug  1 08:12:33 host-003 cib[32268]:   error: XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
Aug  1 08:12:33 host-003 cib[32268]:   error: Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml"
Aug  1 08:12:33 host-003 cib[32268]:   error: Couldn't parse /var/lib/pacemaker/cib/cib.xml
Aug  1 08:12:33 host-003 cib[32268]: warning: Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML)
Aug  1 08:12:33 host-003 cib[32268]:   error: /var/lib/pacemaker/cib/cib.xml was manually modified while the cluster was active!
Aug  1 08:12:33 host-003 cib[32262]:   error: Disk write process exited (pid=32268, rc=208)
Aug  1 08:12:33 host-003 cib[32262]:   error: Disabling disk writes after write failure
Aug  1 08:12:34 host-003 crmd[32267]:  notice: Connecting to cluster infrastructure: corosync
Aug  1 08:12:34 host-003 crmd[32267]: warning: Quorum lost
Aug  1 08:12:34 host-003 crmd[32267]:  notice: Node host-003 state is now member
Aug  1 08:12:34 host-003 crmd[32267]: warning: Support for 'notification-agent' and 'notification-target' cluster options is deprecated and will be removed in a future release (use alerts feature instead)
Aug  1 08:12:34 host-003 crmd[32267]:  notice: The local CRM is operational
Aug  1 08:12:34 host-003 crmd[32267]:  notice: State transition S_STARTING -> S_PENDING
Aug  1 08:12:55 host-003 crmd[32267]: warning: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
Aug  1 08:12:55 host-003 crmd[32267]:  notice: State transition S_ELECTION -> S_INTEGRATION
Aug  1 08:12:55 host-003 crmd[32267]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check
Aug  1 08:12:55 host-003 pengine[32266]: warning: Fencing and resource management disabled due to lack of quorum
Aug  1 08:12:55 host-003 pengine[32266]:   error: Resource start-up disabled since no STONITH resources have been defined
Aug  1 08:12:55 host-003 pengine[32266]:   error: Either configure some or disable STONITH with the stonith-enabled option
Aug  1 08:12:55 host-003 pengine[32266]:   error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Aug  1 08:12:55 host-003 pengine[32266]:  notice: Delaying fencing operations until there are resources to manage
Aug  1 08:12:55 host-003 pengine[32266]: warning: Node host-002 is unclean!
Aug  1 08:12:55 host-003 pengine[32266]:  notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)
Aug  1 08:12:55 host-003 pengine[32266]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-0.bz2
Aug  1 08:12:55 host-003 pengine[32266]:  notice: Configuration ERRORs found during PE processing.  Please run "crm_verify -L" to identify issues.
Aug  1 08:12:55 host-003 crmd[32267]:  notice: Transition 0 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Complete
>> Aug  1 08:12:55 host-003 crmd[32267]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE

After the fix (pacemaker-1.1.19-3.el7.x86_64)
---------------------------------------------
[root@host-002 ~]# date && pcs cluster start
Wed Aug  1 07:50:44 CDT 2018
Starting Cluster (corosync)...
Starting Cluster (pacemaker)...

[root@host-002 ~]# pcs cluster status
Error: cluster is not currently running on this node

[root@host-002 ~]# cat /var/log/messages
...
Aug  1 07:50:55 host-002 cib[1115]:  notice: Additional logging available in /var/log/cluster/corosync.log
>> Aug  1 07:50:55 host-002 cib[1115]:  notice: /var/lib/pacemaker/cib/cib.xml is not owned by user hacluster
>> Aug  1 07:50:55 host-002 cib[1115]:  notice: /var/lib/pacemaker/cib/cib.xml is not owned by group haclient
Aug  1 07:50:55 host-002 cib[1115]:   error: /var/lib/pacemaker/cib/cib.xml must be owned and writable by either user hacluster or group haclient
Aug  1 07:50:55 host-002 cib[1115]:   error: startCib: Triggered fatal assert at main.c:559 : cib != NULL
Aug  1 07:50:55 host-002 abrt-hook-ccpp: Process 1115 (cib) of user 189 killed by SIGABRT - ignoring (repeated crash)
Aug  1 07:50:55 host-002 abrt-hook-ccpp: Saved core dump of pid 1115 to core.1115 at /var/lib/pacemaker/cores (1527808 bytes)
Aug  1 07:50:55 host-002 pacemakerd[857]:   error: Managed process 1115 (cib) dumped core
Aug  1 07:50:55 host-002 pacemakerd[857]:   error: The cib process (1115) terminated with signal 6 (core=1)
Aug  1 07:50:55 host-002 pacemakerd[857]:   error: Child respawn count exceeded by cib
Aug  1 07:50:55 host-002 crmd[863]: warning: Couldn't complete CIB registration 4 times... pause and retry
Aug  1 07:50:55 host-002 stonith-ng[859]:   error: Could not connect to the CIB service: Transport endpoint is not connected (-107)
Aug  1 07:50:55 host-002 stonith-ng[859]:  notice: Node host-003 state is now member
Aug  1 07:50:58 host-002 crmd[863]: warning: Couldn't complete CIB registration 5 times... pause and retry
Aug  1 07:51:01 host-002 crmd[863]: warning: Couldn't complete CIB registration 6 times... pause and retry
Aug  1 07:51:04 host-002 crmd[863]: warning: Couldn't complete CIB registration 7 times... pause and retry
Aug  1 07:51:07 host-002 crmd[863]: warning: Couldn't complete CIB registration 8 times... pause and retry
Aug  1 07:51:10 host-002 crmd[863]: warning: Couldn't complete CIB registration 9 times... pause and retry
Aug  1 07:51:13 host-002 crmd[863]: warning: Couldn't complete CIB registration 10 times... pause and retry
Aug  1 07:51:16 host-002 crmd[863]: warning: Couldn't complete CIB registration 11 times... pause and retry
Aug  1 07:51:19 host-002 crmd[863]: warning: Couldn't complete CIB registration 12 times... pause and retry
Aug  1 07:51:22 host-002 crmd[863]: warning: Couldn't complete CIB registration 13 times... pause and retry
Aug  1 07:51:25 host-002 crmd[863]: warning: Couldn't complete CIB registration 14 times... pause and retry
Aug  1 07:51:28 host-002 crmd[863]: warning: Couldn't complete CIB registration 15 times... pause and retry
Aug  1 07:51:30 host-002 attrd[861]:   error: Signon to CIB failed: Transport endpoint is not connected (-107)
Aug  1 07:51:30 host-002 pacemakerd[857]: warning: The attrd process (861) can no longer be respawned, shutting the cluster down.
Aug  1 07:51:30 host-002 pacemakerd[857]:  notice: Shutting down Pacemaker
Aug  1 07:51:30 host-002 pacemakerd[857]:  notice: Stopping crmd
Aug  1 07:51:30 host-002 crmd[863]: warning: Couldn't complete CIB registration 16 times... pause and retry
Aug  1 07:51:30 host-002 crmd[863]:  notice: Caught 'Terminated' signal
Aug  1 07:51:30 host-002 crmd[863]:  notice: Shutting down cluster resource manager
Aug  1 07:51:30 host-002 crmd[863]: warning: Input I_SHUTDOWN received in state S_STARTING from crm_shutdown
Aug  1 07:51:30 host-002 crmd[863]:  notice: State transition S_STARTING -> S_STOPPING
Aug  1 07:51:30 host-002 crmd[863]:  notice: Disconnected from Corosync
Aug  1 07:51:30 host-002 crmd[863]:  notice: Disconnected from the CIB
Aug  1 07:51:30 host-002 pacemakerd[857]:  notice: Stopping pengine
Aug  1 07:51:30 host-002 pengine[862]:  notice: Caught 'Terminated' signal
Aug  1 07:51:30 host-002 pacemakerd[857]:  notice: Stopping lrmd
Aug  1 07:51:30 host-002 lrmd[860]:  notice: Caught 'Terminated' signal
Aug  1 07:51:30 host-002 stonith-ng[859]:  notice: Caught 'Terminated' signal
Aug  1 07:51:30 host-002 pacemakerd[857]:  notice: Stopping stonith-ng
>> Aug  1 07:51:30 host-002 pacemakerd[857]:  notice: Shutdown complete
Aug  1 07:51:30 host-002 pacemakerd[857]:  notice: Attempting to inhibit respawning after fatal error

Comment 8 errata-xmlrpc 2018-10-30 07:57:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055