Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1437417

Summary: Overcloud deployment fails on controller step1 because one controller cannot connect to the cluster
Product: Red Hat OpenStack Reporter: Yurii Prokulevych <yprokule>
Component: puppet-pacemakerAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: Arik Chernetsky <achernet>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 11.0 (Ocata)CC: agurenko, aschultz, dnavale, fdinitto, jcoufal, jjoyce, jschluet, lbezdick, mburns, mcornea, michele, mkrcmari, rhel-osp-director-maint, samccann, slinaber, tvignaud, ushkalim
Target Milestone: betaKeywords: Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-pacemaker-0.5.0-3.el7ost Doc Type: Bug Fix
Doc Text:
Previously, sometimes a deployment failed with the following error: Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0 With this update, a small race condition where puppet pacemakes could fail during cluster setup was closed. As a result, the deployment works correctly without errors.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 20:16:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1394025    

Description Yurii Prokulevych 2017-03-30 09:43:13 UTC
Description of problem:
-----------------------
Clone of upstream bz.

The deployment fails because (in this case) overcloud-controller-2 cannot join the cluster.
The message you see on the controller logs are:

Error: /sbin/pcs cluster start --all returned 1 instead of one of 0
Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0
Cluster status is:

Checking the status of the cluster you see a node did not joined the cluster:

[root@overcloud-controller-0 deployed]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Wed Mar 29 08:08:12 2017 Last change: Wed Mar 29 07:25:23 2017 by hacluster via crmd on overcloud-controller-1

3 nodes and 0 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-2 ]

No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

On the node missing corosync is fine https://thirdparty-logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-56/overcloud-controller-2/var/log/cluster/corosync.log.gz and status is good:

[root@overcloud-controller-2 deployed]# corosync-quorumtool
Quorum information
------------------
Date: Wed Mar 29 08:23:37 2017
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 3
Ring ID: 2/12
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
    Nodeid Votes Name
         2 1 overcloud-controller-1
         1 1 overcloud-controller-0
         3 1 overcloud-controller-2 (local)

The pacemaker process instead is not running on the host.

This is a race since a further *identical* test went fine, it looks like a problem with timings. Maybe one machine gets deployed too early or too late and the cluster sync fails.
In fact, this is the log from controller-0:

Mar 29 07:25:56 - controller-0 -> Error connecting to overcloud-controller-2 - (HTTP error: 400)
Mar 29 08:27:33 - controller-2 -> cluster is not currently running on this node

Comment 1 Lukas Bezdicka 2017-03-30 10:04:31 UTC
Just to verify could you run iptables -nL  and ip6tables -nL if it was ipv6 deployment? I just want to ensure this isn't different bug.

Comment 3 Michele Baldessari 2017-04-03 14:42:55 UTC
Review has merged upstream in master and puppet-pacemaker has no stable branches (?)

Comment 7 errata-xmlrpc 2017-05-17 20:16:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245