1437417 – Overcloud deployment fails on controller step1 because one controller cannot connect to the cluster

Bug 1437417 - Overcloud deployment fails on controller step1 because one controller cannot connect to the cluster

Summary: Overcloud deployment fails on controller step1 because one controller cannot ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-pacemaker
Sub Component:
Version:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	11.0 (Ocata)
Assignee:	Michele Baldessari
QA Contact:	Arik Chernetsky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1394025
TreeView+	depends on / blocked

Reported:	2017-03-30 09:43 UTC by Yurii Prokulevych
Modified:	2017-09-15 03:07 UTC (History)
CC List:	17 users (show)
Fixed In Version:	puppet-pacemaker-0.5.0-3.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, sometimes a deployment failed with the following error: Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0 With this update, a small race condition where puppet pacemakes could fail during cluster setup was closed. As a result, the deployment works correctly without errors.
Clone Of:
Environment:
Last Closed:	2017-05-17 20:16:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1677312	None	None	None	2017-03-30 09:43:13 UTC
OpenStack gerrit	451828	None	MERGED	Fix potential cluster setup race	2020-12-31 18:53:43 UTC
Red Hat Product Errata	RHEA-2017:1245	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory	2017-05-17 23:01:50 UTC

Description Yurii Prokulevych 2017-03-30 09:43:13 UTC

Description of problem:
-----------------------
Clone of upstream bz.

The deployment fails because (in this case) overcloud-controller-2 cannot join the cluster.
The message you see on the controller logs are:

Error: /sbin/pcs cluster start --all returned 1 instead of one of 0
Error: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster start --all returned 1 instead of one of 0
Cluster status is:

Checking the status of the cluster you see a node did not joined the cluster:

[root@overcloud-controller-0 deployed]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Wed Mar 29 08:08:12 2017 Last change: Wed Mar 29 07:25:23 2017 by hacluster via crmd on overcloud-controller-1

3 nodes and 0 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-2 ]

No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

On the node missing corosync is fine https://thirdparty-logs.rdoproject.org/jenkins-oooq-ocata-rdo_trunk-bmu-haa01-lab-float_nic_with_vlans-56/overcloud-controller-2/var/log/cluster/corosync.log.gz and status is good:

[root@overcloud-controller-2 deployed]# corosync-quorumtool
Quorum information
------------------
Date: Wed Mar 29 08:23:37 2017
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 3
Ring ID: 2/12
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
    Nodeid Votes Name
         2 1 overcloud-controller-1
         1 1 overcloud-controller-0
         3 1 overcloud-controller-2 (local)

The pacemaker process instead is not running on the host.

This is a race since a further *identical* test went fine, it looks like a problem with timings. Maybe one machine gets deployed too early or too late and the cluster sync fails.
In fact, this is the log from controller-0:

Mar 29 07:25:56 - controller-0 -> Error connecting to overcloud-controller-2 - (HTTP error: 400)
Mar 29 08:27:33 - controller-2 -> cluster is not currently running on this node

Comment 1 Lukas Bezdicka 2017-03-30 10:04:31 UTC

Just to verify could you run iptables -nL  and ip6tables -nL if it was ipv6 deployment? I just want to ensure this isn't different bug.

Comment 3 Michele Baldessari 2017-04-03 14:42:55 UTC

Review has merged upstream in master and puppet-pacemaker has no stable branches (?)

Comment 7 errata-xmlrpc 2017-05-17 20:16:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245

Note You need to log in before you can comment on or make changes to this bug.