Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1451842

Summary:	pcsd race condition on overcloud deploy while calling cluster_destroy
Product:	Red Hat OpenStack	Reporter:	Raoul Scarazzini <rscarazz>
Component:	puppet-pacemaker	Assignee:	Michele Baldessari <michele>
Status:	CLOSED ERRATA	QA Contact:	Marian Krcmarik <mkrcmari>
Severity:	low	Docs Contact:
Priority:	low
Version:	11.0 (Ocata)	CC:	chjones, fdinitto, jjoyce, jschluet, michele, rscarazz, slinaber, tvignaud, ushkalim
Target Milestone:	Upstream M2	Keywords:	Triaged
Target Release:	12.0 (Pike)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	puppet-pacemaker-0.6.1-0.20170609092028.4b2c5aa.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-13 21:28:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Raoul Scarazzini 2017-05-17 15:34:28 UTC

Description of problem:

Overcloud deploy fails while trying to add a messaging node in a composable environment.
On the controller node the failure seems related to the cluster setup command:

Error: /sbin/pcs cluster setup --wait --name tripleo_cluster controller-0 controller-1 controller-2 messaging-0 messaging-1 messaging-2 galera-0 galera-1 galera-2 --token 10000 returned 1 instead of one of [0]
Error: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster setup --wait --name tripleo_cluster controller-0 controller-1 controller-2 messaging-0 messaging-1 messaging-2 galera-0 galera-1 galera-2 --token 10000 returned 1 instead of one of [0]

Looking at the nodes (sosreport will be attached) error is related to the messaging-1 node:

Error: /sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]
Error: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: change from notrun to 0 failed: /sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]

The pcsd call that fails is specifically this one:

::ffff:172.17.0.24 - - [17/May/2017:01:46:41 +0000] "GET /remote/cluster_destroy HTTP/1.1" 401 24 0.0181

this gives 401 and the entire deployment to fail.

Version-Release number of selected component (if applicable):

The tested puddle is 2017-05-09.2

How reproducible:

It's a race, so no specific tests are needed, just some continuous deployment on the same env.

Actual results:

Deploy fails.

Expected results:

Deploy succeed.

Additional info:

This race happened 1 time in a string of 20 consecutive deployments, so can be considered "rare".

Comment 1 Raoul Scarazzini 2017-05-17 16:39:11 UTC

Here the sosreports: http://file.rdu.redhat.com/~rscarazz/BZ1451842/

Comment 8 errata-xmlrpc 2017-12-13 21:28:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462