Bug 1564536
Summary: | Cluster should not start with empty configuration if /var/lib/pacemaker/* subtree has wrong permissions | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ondrej Benes <obenes> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | abeekhof, cluster-maint, mnovacek, sbradley |
Target Milestone: | rc | ||
Target Release: | 7.6 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-1.1.18-13.el7 | Doc Type: | No Doc Update |
Doc Text: |
Most users will never encounter this issue
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-30 07:57:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ondrej Benes
2018-04-06 14:54:46 UTC
The cib does not run as root, so it cannot try to correct the permissions, but we can exit and stay down rather than continue with an empty CIB. Note that currently pacemaker never tells corosync to shut down, so the node could continue to survive in the corosync membership and show up as "pending" in the pacemaker membership (resources will not be scheduled on it). This might be changed in the future so that pacemaker tells corosync to shut down as well. QA: You can test various permission scenarios. The cib should exit with an appropriate log message if /var/lib/pacemaker/cib does not exist or is not writeable by at least one of hacluster user and haclient group, and should exit if cib.xml exists but is not writeable by at least one of hacluster/haclient. Fix is upstream as of commit 96c8d58f FYI the issue of corosync not shutting down when pacemaker shuts down abnormally is addressed by Bug 1448221 I have verified that pacemaker will not start when /var/lib/pacemaker is not readable by user hacluster or group haclient with pacemaker-1.1.19-3.el7.x86_64. ---- [root@host-002 pacemaker]# chown -R root:root /var/lib/pacemaker [root@host-002 pacemaker]# pcs status Error: cluster is not currently running on this node Before the fix (1.1.18-12.el7.x86_64) ------------------------------------- [root@host-003 ~]# pcs cluster start Starting Cluster (corosync)... Starting Cluster (pacemaker)... [root@host-003 ~]# pcs status Cluster name: STSRHTS5529 WARNINGS: No stonith devices and stonith-enabled is not false Stack: unknown Current DC: NONE Last updated: Wed Aug 1 08:12:41 2018 Last change: Wed Aug 1 08:12:34 2018 by hacluster via crmd on host-003 2 nodes configured 0 resources configured Node host-002: UNCLEAN (offline) Node host-003: UNCLEAN (offline) No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@host-003 ~]# cat /var/log/messages ... Aug 1 08:12:33 host-003 cib[32262]: notice: Connecting to cluster infrastructure: corosync Aug 1 08:12:33 host-003 cib[32262]: notice: Node host-003 state is now member >> Aug 1 08:12:33 host-003 cib[32268]: error: XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity "/var/lib/pacemaker/cib/cib.xml" Aug 1 08:12:33 host-003 cib[32268]: error: Parsing failed (domain=8, level=1, code=1549): failed to load external entity "/var/lib/pacemaker/cib/cib.xml" Aug 1 08:12:33 host-003 cib[32268]: error: Couldn't parse /var/lib/pacemaker/cib/cib.xml Aug 1 08:12:33 host-003 cib[32268]: warning: Cluster configuration file /var/lib/pacemaker/cib/cib.xml is corrupt (unparseable as XML) Aug 1 08:12:33 host-003 cib[32268]: error: /var/lib/pacemaker/cib/cib.xml was manually modified while the cluster was active! Aug 1 08:12:33 host-003 cib[32262]: error: Disk write process exited (pid=32268, rc=208) Aug 1 08:12:33 host-003 cib[32262]: error: Disabling disk writes after write failure Aug 1 08:12:34 host-003 crmd[32267]: notice: Connecting to cluster infrastructure: corosync Aug 1 08:12:34 host-003 crmd[32267]: warning: Quorum lost Aug 1 08:12:34 host-003 crmd[32267]: notice: Node host-003 state is now member Aug 1 08:12:34 host-003 crmd[32267]: warning: Support for 'notification-agent' and 'notification-target' cluster options is deprecated and will be removed in a future release (use alerts feature instead) Aug 1 08:12:34 host-003 crmd[32267]: notice: The local CRM is operational Aug 1 08:12:34 host-003 crmd[32267]: notice: State transition S_STARTING -> S_PENDING Aug 1 08:12:55 host-003 crmd[32267]: warning: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped Aug 1 08:12:55 host-003 crmd[32267]: notice: State transition S_ELECTION -> S_INTEGRATION Aug 1 08:12:55 host-003 crmd[32267]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check Aug 1 08:12:55 host-003 pengine[32266]: warning: Fencing and resource management disabled due to lack of quorum Aug 1 08:12:55 host-003 pengine[32266]: error: Resource start-up disabled since no STONITH resources have been defined Aug 1 08:12:55 host-003 pengine[32266]: error: Either configure some or disable STONITH with the stonith-enabled option Aug 1 08:12:55 host-003 pengine[32266]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity Aug 1 08:12:55 host-003 pengine[32266]: notice: Delaying fencing operations until there are resources to manage Aug 1 08:12:55 host-003 pengine[32266]: warning: Node host-002 is unclean! Aug 1 08:12:55 host-003 pengine[32266]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) Aug 1 08:12:55 host-003 pengine[32266]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-0.bz2 Aug 1 08:12:55 host-003 pengine[32266]: notice: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues. Aug 1 08:12:55 host-003 crmd[32267]: notice: Transition 0 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Complete >> Aug 1 08:12:55 host-003 crmd[32267]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE After the fix (pacemaker-1.1.19-3.el7.x86_64) --------------------------------------------- [root@host-002 ~]# date && pcs cluster start Wed Aug 1 07:50:44 CDT 2018 Starting Cluster (corosync)... Starting Cluster (pacemaker)... [root@host-002 ~]# pcs cluster status Error: cluster is not currently running on this node [root@host-002 ~]# cat /var/log/messages ... Aug 1 07:50:55 host-002 cib[1115]: notice: Additional logging available in /var/log/cluster/corosync.log >> Aug 1 07:50:55 host-002 cib[1115]: notice: /var/lib/pacemaker/cib/cib.xml is not owned by user hacluster >> Aug 1 07:50:55 host-002 cib[1115]: notice: /var/lib/pacemaker/cib/cib.xml is not owned by group haclient Aug 1 07:50:55 host-002 cib[1115]: error: /var/lib/pacemaker/cib/cib.xml must be owned and writable by either user hacluster or group haclient Aug 1 07:50:55 host-002 cib[1115]: error: startCib: Triggered fatal assert at main.c:559 : cib != NULL Aug 1 07:50:55 host-002 abrt-hook-ccpp: Process 1115 (cib) of user 189 killed by SIGABRT - ignoring (repeated crash) Aug 1 07:50:55 host-002 abrt-hook-ccpp: Saved core dump of pid 1115 to core.1115 at /var/lib/pacemaker/cores (1527808 bytes) Aug 1 07:50:55 host-002 pacemakerd[857]: error: Managed process 1115 (cib) dumped core Aug 1 07:50:55 host-002 pacemakerd[857]: error: The cib process (1115) terminated with signal 6 (core=1) Aug 1 07:50:55 host-002 pacemakerd[857]: error: Child respawn count exceeded by cib Aug 1 07:50:55 host-002 crmd[863]: warning: Couldn't complete CIB registration 4 times... pause and retry Aug 1 07:50:55 host-002 stonith-ng[859]: error: Could not connect to the CIB service: Transport endpoint is not connected (-107) Aug 1 07:50:55 host-002 stonith-ng[859]: notice: Node host-003 state is now member Aug 1 07:50:58 host-002 crmd[863]: warning: Couldn't complete CIB registration 5 times... pause and retry Aug 1 07:51:01 host-002 crmd[863]: warning: Couldn't complete CIB registration 6 times... pause and retry Aug 1 07:51:04 host-002 crmd[863]: warning: Couldn't complete CIB registration 7 times... pause and retry Aug 1 07:51:07 host-002 crmd[863]: warning: Couldn't complete CIB registration 8 times... pause and retry Aug 1 07:51:10 host-002 crmd[863]: warning: Couldn't complete CIB registration 9 times... pause and retry Aug 1 07:51:13 host-002 crmd[863]: warning: Couldn't complete CIB registration 10 times... pause and retry Aug 1 07:51:16 host-002 crmd[863]: warning: Couldn't complete CIB registration 11 times... pause and retry Aug 1 07:51:19 host-002 crmd[863]: warning: Couldn't complete CIB registration 12 times... pause and retry Aug 1 07:51:22 host-002 crmd[863]: warning: Couldn't complete CIB registration 13 times... pause and retry Aug 1 07:51:25 host-002 crmd[863]: warning: Couldn't complete CIB registration 14 times... pause and retry Aug 1 07:51:28 host-002 crmd[863]: warning: Couldn't complete CIB registration 15 times... pause and retry Aug 1 07:51:30 host-002 attrd[861]: error: Signon to CIB failed: Transport endpoint is not connected (-107) Aug 1 07:51:30 host-002 pacemakerd[857]: warning: The attrd process (861) can no longer be respawned, shutting the cluster down. Aug 1 07:51:30 host-002 pacemakerd[857]: notice: Shutting down Pacemaker Aug 1 07:51:30 host-002 pacemakerd[857]: notice: Stopping crmd Aug 1 07:51:30 host-002 crmd[863]: warning: Couldn't complete CIB registration 16 times... pause and retry Aug 1 07:51:30 host-002 crmd[863]: notice: Caught 'Terminated' signal Aug 1 07:51:30 host-002 crmd[863]: notice: Shutting down cluster resource manager Aug 1 07:51:30 host-002 crmd[863]: warning: Input I_SHUTDOWN received in state S_STARTING from crm_shutdown Aug 1 07:51:30 host-002 crmd[863]: notice: State transition S_STARTING -> S_STOPPING Aug 1 07:51:30 host-002 crmd[863]: notice: Disconnected from Corosync Aug 1 07:51:30 host-002 crmd[863]: notice: Disconnected from the CIB Aug 1 07:51:30 host-002 pacemakerd[857]: notice: Stopping pengine Aug 1 07:51:30 host-002 pengine[862]: notice: Caught 'Terminated' signal Aug 1 07:51:30 host-002 pacemakerd[857]: notice: Stopping lrmd Aug 1 07:51:30 host-002 lrmd[860]: notice: Caught 'Terminated' signal Aug 1 07:51:30 host-002 stonith-ng[859]: notice: Caught 'Terminated' signal Aug 1 07:51:30 host-002 pacemakerd[857]: notice: Stopping stonith-ng >> Aug 1 07:51:30 host-002 pacemakerd[857]: notice: Shutdown complete Aug 1 07:51:30 host-002 pacemakerd[857]: notice: Attempting to inhibit respawning after fatal error Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3055 |