Description of problem:
Clone of upstream bz.
The deployment fails because (in this case) overcloud-controller-2 cannot join the cluster.
The message you see on the controller logs are:
Error: /sbin/pcs cluster start --all returned 1 instead of one of 0
Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0
Cluster status is:
Checking the status of the cluster you see a node did not joined the cluster:
[root@overcloud-controller-0 deployed]# pcs status
Cluster name: tripleo_cluster
Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Wed Mar 29 08:08:12 2017 Last change: Wed Mar 29 07:25:23 2017 by hacluster via crmd on overcloud-controller-1
3 nodes and 0 resources configured
Online: [ overcloud-controller-0 overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-2 ]
On the node missing corosync is fine https://thirdparty-logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-56/overcloud-controller-2/var/log/cluster/corosync.log.gz and status is good:
[root@overcloud-controller-2 deployed]# corosync-quorumtool
Date: Wed Mar 29 08:23:37 2017
Quorum provider: corosync_votequorum
Node ID: 3
Ring ID: 2/12
Expected votes: 3
Highest expected: 3
Total votes: 3
Nodeid Votes Name
2 1 overcloud-controller-1
1 1 overcloud-controller-0
3 1 overcloud-controller-2 (local)
The pacemaker process instead is not running on the host.
This is a race since a further *identical* test went fine, it looks like a problem with timings. Maybe one machine gets deployed too early or too late and the cluster sync fails.
In fact, this is the log from controller-0:
Mar 29 07:25:56 - controller-0 -> Error connecting to overcloud-controller-2 - (HTTP error: 400)
Mar 29 08:27:33 - controller-2 -> cluster is not currently running on this node
Just to verify could you run iptables -nL and ip6tables -nL if it was ipv6 deployment? I just want to ensure this isn't different bug.
Review has merged upstream in master and puppet-pacemaker has no stable branches (?)
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.