Description of problem: ----------------------- Clone of upstream bz. The deployment fails because (in this case) overcloud-controller-2 cannot join the cluster. The message you see on the controller logs are: Error: /sbin/pcs cluster start --all returned 1 instead of one of 0 Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0 Cluster status is: Checking the status of the cluster you see a node did not joined the cluster: [root@overcloud-controller-0 deployed]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Wed Mar 29 08:08:12 2017 Last change: Wed Mar 29 07:25:23 2017 by hacluster via crmd on overcloud-controller-1 3 nodes and 0 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 ] OFFLINE: [ overcloud-controller-2 ] No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled On the node missing corosync is fine https://thirdparty-logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-56/overcloud-controller-2/var/log/cluster/corosync.log.gz and status is good: [root@overcloud-controller-2 deployed]# corosync-quorumtool Quorum information ------------------ Date: Wed Mar 29 08:23:37 2017 Quorum provider: corosync_votequorum Nodes: 3 Node ID: 3 Ring ID: 2/12 Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 2 1 overcloud-controller-1 1 1 overcloud-controller-0 3 1 overcloud-controller-2 (local) The pacemaker process instead is not running on the host. This is a race since a further *identical* test went fine, it looks like a problem with timings. Maybe one machine gets deployed too early or too late and the cluster sync fails. In fact, this is the log from controller-0: Mar 29 07:25:56 - controller-0 -> Error connecting to overcloud-controller-2 - (HTTP error: 400) Mar 29 08:27:33 - controller-2 -> cluster is not currently running on this node
Just to verify could you run iptables -nL and ip6tables -nL if it was ipv6 deployment? I just want to ensure this isn't different bug.
Review has merged upstream in master and puppet-pacemaker has no stable branches (?)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245