Bug 1437417 - Overcloud deployment fails on controller step1 because one controller cannot connect to the cluster
Summary: Overcloud deployment fails on controller step1 because one controller cannot ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-pacemaker
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 11.0 (Ocata)
Assignee: Michele Baldessari
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks: 1394025
TreeView+ depends on / blocked
 
Reported: 2017-03-30 09:43 UTC by Yurii Prokulevych
Modified: 2017-09-15 03:07 UTC (History)
17 users (show)

Fixed In Version: puppet-pacemaker-0.5.0-3.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, sometimes a deployment failed with the following error: Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0 With this update, a small race condition where puppet pacemakes could fail during cluster setup was closed. As a result, the deployment works correctly without errors.
Clone Of:
Environment:
Last Closed: 2017-05-17 20:16:21 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1245 normal SHIPPED_LIVE Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory 2017-05-17 23:01:50 UTC
OpenStack gerrit 451828 None None None 2017-03-31 06:42:32 UTC
Launchpad 1677312 None None None 2017-03-30 09:43:13 UTC

Description Yurii Prokulevych 2017-03-30 09:43:13 UTC
Description of problem:
-----------------------
Clone of upstream bz.

The deployment fails because (in this case) overcloud-controller-2 cannot join the cluster.
The message you see on the controller logs are:

Error: /sbin/pcs cluster start --all returned 1 instead of one of 0
Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0
Cluster status is:

Checking the status of the cluster you see a node did not joined the cluster:

[root@overcloud-controller-0 deployed]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Wed Mar 29 08:08:12 2017 Last change: Wed Mar 29 07:25:23 2017 by hacluster via crmd on overcloud-controller-1

3 nodes and 0 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-2 ]

No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

On the node missing corosync is fine https://thirdparty-logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-56/overcloud-controller-2/var/log/cluster/corosync.log.gz and status is good:

[root@overcloud-controller-2 deployed]# corosync-quorumtool
Quorum information
------------------
Date: Wed Mar 29 08:23:37 2017
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 3
Ring ID: 2/12
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
    Nodeid Votes Name
         2 1 overcloud-controller-1
         1 1 overcloud-controller-0
         3 1 overcloud-controller-2 (local)

The pacemaker process instead is not running on the host.

This is a race since a further *identical* test went fine, it looks like a problem with timings. Maybe one machine gets deployed too early or too late and the cluster sync fails.
In fact, this is the log from controller-0:

Mar 29 07:25:56 - controller-0 -> Error connecting to overcloud-controller-2 - (HTTP error: 400)
Mar 29 08:27:33 - controller-2 -> cluster is not currently running on this node

Comment 1 Lukas Bezdicka 2017-03-30 10:04:31 UTC
Just to verify could you run iptables -nL  and ip6tables -nL if it was ipv6 deployment? I just want to ensure this isn't different bug.

Comment 3 Michele Baldessari 2017-04-03 14:42:55 UTC
Review has merged upstream in master and puppet-pacemaker has no stable branches (?)

Comment 7 errata-xmlrpc 2017-05-17 20:16:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245


Note You need to log in before you can comment on or make changes to this bug.