Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 906529

Summary: Cluster configuration fails using cluster manager pcs
Product: Red Hat Enterprise Linux 7 Reporter: Justin Payne <jpayne>
Component: pcsAssignee: Chris Feist <cfeist>
Status: CLOSED NOTABUG QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.1CC: bugproxy, cfeist, cluster-maint, jkachuck, jkortus, wgomerin
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 894232 Environment:
Last Closed: 2013-02-20 17:58:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 894232    
Bug Blocks: 744225    

Description Justin Payne 2013-01-31 19:42:48 UTC
+++ This bug was initially created as a clone of Bug #894232 +++

Problem Description
------------------------------
Configuring two node cluster on P7 LPARs using cluster manager pcs , noticed both nodes are not getting online and accessing pcs properties gives error as below.

[root@c57f1ju0203 ~]# pcs status
Last updated: Thu Jan 10 11:38:42 2013
Last change: Thu Jan 10 11:03:51 2013 via crmd on c57f1ju0203
Current DC: NONE
2 Nodes configured, unknown expected votes
0 Resources configured.


Node c57f1ju0203 (1): UNCLEAN (offline)
Node c57f1ju0204 (2): UNCLEAN (offline)
Full list of resources:


[root@c57f1ju0203 ~]# pcs property
ERROR: Unable to get crm_config
Call cib_query failed (-62): Timer expired
<null>

I also get error as below while trying to stup property using pcs.
[root@c57f1ju0203 ~]# pcs property set no-quorum-policy=ignore
Unable to get crm_config, is pacemaker running?


[root@c57f1ju0203 ~]# ps axf | egrep "corosync|pacemaker" | grep -v egrep
15825 ?        Ssl    0:15 corosync
15832 ?        Ssl    0:00 /usr/sbin/pacemakerd -f
15833 ?        Ssl    0:00  \_ /usr/libexec/pacemaker/cib
15835 ?        Ss     0:00  \_ /usr/libexec/pacemaker/stonithd
15836 ?        Ss     0:00  \_ /usr/libexec/pacemaker/lrmd
15837 ?        Ss     0:00  \_ /usr/libexec/pacemaker/attrd
15838 ?        Ss     0:00  \_ /usr/libexec/pacemaker/pengine

You can note here /usr/libexec/pacemaker/crmd is not running as mentioned in document "Cluster from scratch using pcs".

[root@c57f1ju0203 ~]# pcs status corosync

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR c57f1ju0203
         2          1         NR c57f1ju0204


--- corosync.conf---
[root@c57f1ju0203 ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
cluster_name: mycluster
transport: udpu
}

nodelist {
  node {
        ring0_addr: c57f1ju0203
        nodeid: 1
       }
  node {
        ring0_addr: c57f1ju0204
        nodeid: 2
       }
}

quorum {
provider: corosync_votequorum
}

logging {
to_syslog: yes
}



[root@c57f1ju0203 ~]# systemctl status pacemaker.service
pacemaker.service - Pacemaker High Availability Cluster Manager
	  Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled)
	  Active: active (running) since Thu, 2013-01-10 11:03:50 EST; 38min ago
	Main PID: 15832 (pacemakerd)
	  CGroup: name=systemd:/system/pacemaker.service
		  ? 15832 /usr/sbin/pacemakerd -f
		  ? 15833 /usr/libexec/pacemaker/cib
		  ? 15835 /usr/libexec/pacemaker/stonithd
		  ? 15836 /usr/libexec/pacemaker/lrmd
		  ? 15837 /usr/libexec/pacemaker/attrd
		  ? 15838 /usr/libexec/pacemaker/pengine

Jan 10 11:03:53 c57f1ju0203 cib[15833]: warning: qb_ipcs_event_sendv: new_event_notification (15833-15941-10): Broken pipe (32)
Jan 10 11:03:53 c57f1ju0203 cib[15833]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Jan 10 11:03:53 c57f1ju0203 pacemakerd[15832]: error: pcmk_child_exit: Child process crmd exited (pid=15941, rc=2)
Jan 10 11:03:53 c57f1ju0203 crmd[15942]: error: check_dead_member: We're not part of the cluster anymore
Jan 10 11:03:53 c57f1ju0203 crmd[15942]: error: do_log: FSA: Input I_ERROR from check_dead_member() received in state S_STARTING
Jan 10 11:03:53 c57f1ju0203 crmd[15942]: warning: do_state_transition: State transition S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=check_dead_member ]
Jan 10 11:03:53 c57f1ju0203 cib[15833]: warning: qb_ipcs_event_sendv: new_event_notification (15833-15942-10): Broken pipe (32)
Jan 10 11:03:53 c57f1ju0203 cib[15833]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Jan 10 11:03:53 c57f1ju0203 pacemakerd[15832]: error: pcmk_child_exit: Child process crmd exited (pid=15942, rc=2)
Jan 10 11:03:53 c57f1ju0203 pacemakerd[15832]: error: pcmk_child_exit: Child respawn count exceeded by crmd


[root@c57f1ju0203 ~]# systemctl status corosync.service
corosync.service - Corosync Cluster Engine
	  Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled)
	  Active: active (running) since Thu, 2013-01-10 11:03:50 EST; 39min ago
	 Process: 15818 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
	Main PID: 15825 (corosync)
	  CGroup: name=systemd:/system/corosync.service
		  ? 15825 corosync

Jan 10 11:03:49 c57f1ju0203 corosync[15825]: [QUORUM] Members[1]: 1
Jan 10 11:03:49 c57f1ju0203 corosync[15825]: [TOTEM ] A processor joined or left the membership and a new membership (9.114.32.144:4) was formed.
Jan 10 11:03:49 c57f1ju0203 corosync[15825]: [MAIN  ] Completed service synchronization, ready to provide service.
Jan 10 11:03:50 c57f1ju0203 corosync[15818]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Jan 10 11:03:50 c57f1ju0203 systemd[1]: Started Corosync Cluster Engine.
Jan 10 11:03:50 c57f1ju0203 corosync[15825]: [YKD   ] Members[2]: 1 2
Jan 10 11:03:50 c57f1ju0203 corosync[15825]: [TOTEM ] A processor joined or left the membership and a new membership (9.114.32.144:8) was formed.
Jan 10 11:03:50 c57f1ju0203 corosync[15825]: [QUORUM] This node is within the primary component and will provide service.
Jan 10 11:03:50 c57f1ju0203 corosync[15825]: [YKD   ] Members[2]: 1 2
Jan 10 11:03:50 c57f1ju0203 corosync[15825]: [YKD   ] Completed service synchronization, ready to provide service.


[root@c57f1ju0203 ~]# uname -a
Linux c57f1ju0203 3.6.10-4.fc18.ppc64p7 #1 SMP Wed Dec 12 16:08:02 MST 2012 ppc64 ppc64 ppc64 GNU/Linux

--- Attached log files ---
var-log-messages.txt
dmesg.txt
strace-pcs-property.txt

Here is the package versions installed on system :
----------------------------------------------------------------------
[root@c57f1ju0203 ~]# rpm -qa | grep -i pcs
pcsc-lite-doc-1.8.7-1.fc18.noarch
pcsc-lite-libs-1.8.7-1.fc18.ppc64
pcs-0.9.27-3.fc18.ppc64
pcsc-lite-1.8.7-1.fc18.ppc64
pcsc-lite-openct-0.6.20-5.fc18.ppc64
pcsc-tools-1.4.17-4.fc18.ppc64
pcsc-lite-devel-1.8.7-1.fc18.ppc64
pcsc-perl-1.4.12-5.fc18.ppc64
pcsc-lite-ccid-1.4.8-1.fc18.ppc64

[root@c57f1ju0203 ~]# rpm -qa | grep -i corosync
corosync-2.1.0-1.fc18.ppc64
corosynclib-devel-2.1.0-1.fc18.ppc64
corosynclib-2.1.0-1.fc18.ppc64

[root@c57f1ju0203 ~]# rpm -qa | grep -i pacemaker
pacemaker-1.1.8-3.fc18.ppc64
pacemaker-cluster-libs-1.1.8-3.fc18.ppc64
pacemaker-libs-devel-1.1.8-3.fc18.ppc64
pacemaker-cts-1.1.8-3.fc18.ppc64
pacemaker-cli-1.1.8-3.fc18.ppc64
pacemaker-libs-1.1.8-3.fc18.ppc64
pacemaker-doc-1.1.8-3.fc18.ppc64

[root@c57f1ju0203 ~]# pacemakerd --features
Pacemaker 1.1.8-3.fc18 (Build: 394e906)
 Supporting:  generated-manpages agent-manpages ncurses libqb-logging libqb-ipc upstart systemd  corosync-native

--- Additional comment from IBM Bug Proxy on 2013-01-11 01:11:00 EST ---

Created attachment 676666 [details]
dmesg.txt

--- Additional comment from IBM Bug Proxy on 2013-01-11 01:11:08 EST ---

Created attachment 676667 [details]
var-log-messages.txt

--- Additional comment from IBM Bug Proxy on 2013-01-11 01:11:18 EST ---

Created attachment 676668 [details]
strace-pcs-property

--- Additional comment from Chris Feist on 2013-01-11 15:09:35 EST ---

Can you check and see if there are any errors in /var/log/audit/audit.log?  It appears as though pacemaker is not connecting properly (which is why you're getting the pcs errors).

If there are errors, can you try temporarily disabling selinux and the firewall to verify that they're not causing issues?

To disable selinux, edit /etc/sysconfig/selinux and change the 'SELINUX=' line to 'SELINUX=disabled' and restart.

To disable the firewall run:
systemctl stop iptables.service
systemctl disable iptables.service

--- Additional comment from IBM Bug Proxy on 2013-01-13 23:10:58 EST ---

------- Comment From maknayak.com 2013-01-14 04:06 EDT-------
(In reply to comment #10)

Hello Chris,

> Can you check and see if there are any errors in /var/log/audit/audit.log?

In  /var/log/audit/audit.log ,no errors but there are many failiures reported as below

[root@c57f1ju0203 ~]# cat /var/log/audit/audit.log | grep -i fail
type=SERVICE_STOP msg=audit(1357827937.860:17): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg=' comm="rngd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
type=USER_LOGIN msg=audit(1357828127.823:317): pid=1003 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login acct=28756E6B6E6F776E207573657229 exe="/usr/sbin/sshd" hostname=? addr=9.79.215.218 terminal=ssh res=failed'
type=USER_AUTH msg=audit(1357828169.503:322): pid=1005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd" hostname=9.79.215.218 addr=9.79.215.218 terminal=ssh res=failed'
type=USER_AUTH msg=audit(1357828169.503:323): pid=1005 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=password acct="root" exe="/usr/sbin/sshd" hostname=? addr=9.79.215.218 terminal=ssh res=failed'
type=USER_LOGIN msg=audit(1357832401.047:405): pid=15557 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login acct=28756E6B6E6F776E207573657229 exe="/usr/sbin/sshd" hostname=? addr=9.114.32.145 terminal=ssh res=failed'
type=USER_LOGIN msg=audit(1357836968.293:518): pid=16288 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login acct=28756E6B6E6F776E207573657229 exe="/usr/sbin/sshd" hostname=? addr=9.79.215.218 terminal=ssh res=failed'
type=USER_AUTH msg=audit(1357836974.523:519): pid=16288 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=PAM:authentication acct="?" exe="/usr/sbin/sshd" hostname=9.79.215.218 addr=9.79.215.218 terminal=ssh res=failed'
type=USER_AUTH msg=audit(1357836974.523:520): pid=16288 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=password acct=28696E76616C6964207573657229 exe="/usr/sbin/sshd" hostname=? addr=9.79.215.218 terminal=ssh res=failed'
type=USER_AUTH msg=audit(1357836980.233:521): pid=16288 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=PAM:authentication acct="?" exe="/usr/sbin/sshd" hostname=9.79.215.218 addr=9.79.215.218 terminal=ssh res=failed'
type=USER_AUTH msg=audit(1357836980.233:522): pid=16288 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=password acct=28696E76616C6964207573657229 exe="/usr/sbin/sshd" hostname=? addr=9.79.215.218 terminal=ssh res=failed'
type=USER_LOGIN msg=audit(1357836980.233:526): pid=16288 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 msg='op=login acct=28696E76616C6964207573657229 exe="/usr/sbin/sshd" hostname=? addr=9.79.215.218 terminal=ssh res=failed'

Please see for details log attached file audit.log.

> It appears as though pacemaker is not connecting properly (which is why
> you're getting the pcs errors).
>
> If there are errors, can you try temporarily disabling selinux and the
> firewall to verify that they're not causing issues?
>
> To disable selinux, edit /etc/sysconfig/selinux and change the 'SELINUX='
> line to 'SELINUX=disabled' and restart.
>
> To disable the firewall run:
> systemctl stop iptables.service
> systemctl disable iptables.service

Selinux & firewalls are already disabled.

[root@c57f1ju0203 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

[root@c57f1ju0203 ~]# systemctl status firewalld.service
firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled)
Active: inactive (dead)
CGroup: name=systemd:/system/firewalld.service

Jan 10 09:25:39 c57f1ju0203.ppd.pok.ibm.com systemd[1]: Started firewalld - dynamic firewall daemon.
Jan 10 09:55:30 c57f1ju0203.ppd.pok.ibm.com systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jan 10 09:55:36 c57f1ju0203.ppd.pok.ibm.com systemd[1]: Stopped firewalld - dynamic firewall daemon.

[root@c57f1ju0203 ~]# getenforce
Permissive

Thanks...
Manas

Comment 1 Justin Payne 2013-01-31 19:44:20 UTC
[root@dash-01 ~]# /etc/init.d/pacemaker status
pacemaker.service - Pacemaker High Availability Cluster Manager
          Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled)
          Active: active (running) since Thu 2013-01-31 13:33:37 CST; 4s ago
        Main PID: 7380 (pacemakerd)
          CGroup: name=systemd:/system/pacemaker.service
                  ├─7380 /usr/sbin/pacemakerd -f
                  └─7383 /usr/libexec/pacemaker/stonithd

Jan 31 13:33:37 dash-01.lab.msp.redhat.com attrd[7385]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jan 31 13:33:37 dash-01.lab.msp.redhat.com pengine[7386]: error: crm_is_writable: /var/lib/pacemaker/pengine must exist and be a directory
Jan 31 13:33:37 dash-01.lab.msp.redhat.com pengine[7386]: error: main: Bad permissions on /var/lib/pacemaker/pengine. Terminating
Jan 31 13:33:37 dash-01.lab.msp.redhat.com pacemakerd[7380]: error: pcmk_child_exit: Child process pengine exited (pid=7386, rc=100)
Jan 31 13:33:37 dash-01.lab.msp.redhat.com pacemakerd[7380]: warning: pcmk_child_exit: Pacemaker child process pengine no longer wishes to be respawned. Shutting ourselves down.
Jan 31 13:33:37 dash-01.lab.msp.redhat.com pacemakerd[7380]: notice: stop_child: Stopping attrd: Sent -15 to process 7385
Jan 31 13:33:37 dash-01.lab.msp.redhat.com attrd[7385]: notice: main: Starting mainloop...
Jan 31 13:33:37 dash-01.lab.msp.redhat.com attrd[7385]: notice: main: Exiting...
Jan 31 13:33:37 dash-01.lab.msp.redhat.com pacemakerd[7380]: notice: stop_child: Stopping lrmd: Sent -15 to process 7384
Jan 31 13:33:37 dash-01.lab.msp.redhat.com pacemakerd[7380]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 7383


[root@dash-01 ~]# ls -ld /var/lib/pacemaker/pengine
drwxr-x---. 2 hacluster hacluster 6 Oct 25 05:46 /var/lib/pacemaker/pengine
[root@dash-01 ~]# getenforce
Disabled
[root@dash-01 ~]# systemctl status firewalld.service
firewalld.service - firewalld - dynamic firewall daemon
          Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled)
          Active: inactive (dead)

[root@dash-01 ~]# systemctl status iptables.service
iptables.service
          Loaded: error (Reason: No such file or directory)
          Active: inactive (dead)


[root@dash-01 ~]# chmod 777 /var/lib/pacemaker/pengine
[root@dash-01 ~]# /etc/init.d/pacemaker restart
Restarting pacemaker (via systemctl):                      [  OK  ]
[root@dash-01 ~]# /etc/init.d/pacemaker status
pacemaker.service - Pacemaker High Availability Cluster Manager
          Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled)
          Active: active (running) since Thu 2013-01-31 13:39:21 CST; 5s ago
        Main PID: 8264 (pacemakerd)
          CGroup: name=systemd:/system/pacemaker.service
                  ├─8264 /usr/sbin/pacemakerd -f
                  └─8268 /usr/libexec/pacemaker/stonithd

Jan 31 13:39:22 dash-01.lab.msp.redhat.com pacemakerd[8264]: notice: stop_child: Stopping pengine: Sent -15 to process 8271
Jan 31 13:39:22 dash-01.lab.msp.redhat.com pengine[8271]: error: crm_is_writable: /var/lib/pacemaker/pengine must exist and be a directory
Jan 31 13:39:22 dash-01.lab.msp.redhat.com attrd[8270]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jan 31 13:39:22 dash-01.lab.msp.redhat.com pacemakerd[8264]: error: pcmk_child_exit: Child process pengine exited (pid=8271, rc=100)
Jan 31 13:39:22 dash-01.lab.msp.redhat.com pacemakerd[8264]: warning: pcmk_child_exit: Pacemaker child process pengine no longer wishes to be respawned. Shutting ourselves down.
Jan 31 13:39:22 dash-01.lab.msp.redhat.com pacemakerd[8264]: notice: stop_child: Stopping attrd: Sent -15 to process 8270
Jan 31 13:39:22 dash-01.lab.msp.redhat.com attrd[8270]: notice: main: Starting mainloop...
Jan 31 13:39:22 dash-01.lab.msp.redhat.com attrd[8270]: notice: main: Exiting...
Jan 31 13:39:22 dash-01.lab.msp.redhat.com pacemakerd[8264]: notice: stop_child: Stopping lrmd: Sent -15 to process 8269
Jan 31 13:39:22 dash-01.lab.msp.redhat.com pacemakerd[8264]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 8268
[root@dash-01 ~]# ls -ld /var/lib/pacemaker/pengine
drwxrwxrwx. 2 hacluster hacluster 6 Oct 25 05:46 /var/lib/pacemaker/pengine

Comment 3 Justin Payne 2013-01-31 20:10:38 UTC
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i rpm -q pcs; done
pcs-0.9.30-1.el7.x86_64
pcs-0.9.30-1.el7.x86_64
pcs-0.9.30-1.el7.x86_64

Comment 4 Justin Payne 2013-02-20 17:58:06 UTC
Closing this as it was not a bug. I had created a second hacluster user and caused permissions issues during the cluster setup. Correct behavior is as follows:

[root@dash-01 ~]# ls -ld /var/lib/pacemaker/pengine
drwxr-x---. 2 hacluster haclient 6 Oct 25 05:46 /var/lib/pacemaker/pengine

[root@dash-01 ~]# id hacluster
uid=999(hacluster) gid=999(haclient) groups=999(haclient)

Once the UID and GID of the hacluster user were corrected, things worked as they should.