1421883 – Control Plane Services go down/have outage while rerunning openstack overcloud deploy

Bug 1421883 - Control Plane Services go down/have outage while rerunning openstack overcloud deploy

Summary: Control Plane Services go down/have outage while rerunning openstack overclou...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	z3
Target Release:	10.0 (Newton)
Assignee:	Emilien Macchi
QA Contact:	Gurenko Alex
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1436728 (view as bug list)
Depends On:	1426434 1426439 1438099 1438602 1438886 1441736 1441738 1441757 1441760 1441769
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-14 00:10 UTC by Graeme Gillies
Modified:	2021-03-11 14:57 UTC (History)
CC List:	18 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-5.2.0-18.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-28 14:44:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
output of loop listing servers (115.27 KB, text/plain) 2017-02-14 00:10 UTC, Graeme Gillies	no flags	Details
log of overcloud deploy command (159.41 KB, text/plain) 2017-02-14 00:11 UTC, Graeme Gillies	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1664650	None	None	None	2017-02-14 17:30:03 UTC
OpenStack gerrit	450900	None	MERGED	Only set EnableConfigPurge on major upgrades	2021-01-04 08:19:27 UTC
Red Hat Product Errata	RHBA-2017:1585	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 director Bug Fix Advisory	2017-06-28 18:42:51 UTC

Description Graeme Gillies 2017-02-14 00:10:43 UTC

Created attachment 1250080 [details]
output of loop listing servers

Hi,

I have a RHOS 10 environment which is completely deployed and functional.

What I then did is in a bash window run the following code

while true;do date;OS_CLOUD=rhosops-test-ggillies openstack server list;done 2>&1 | tee server_list_loop.log

I have attached the log. I have also attached the log of the overcloud deploy.

While this was running, I reran my openstack overcloud deploy command, with no changes, essentially this should be a noop operation, or at the very least, should not cause any control plane outage (as absolutely nothing is changing).

You will see in the log however that first keystone throws an error, and then nova throws an error a bit later.

This seems to indicate that all stack deploy operations are disruptive, including configuration changes, node scale up/down, etc.

This is easily reproducible

Regards,

Graeme

Comment 1 Graeme Gillies 2017-02-14 00:11:37 UTC

Created attachment 1250081 [details]
log of overcloud deploy command

Comment 2 Michele Baldessari 2017-02-23 15:01:18 UTC

So with Alex's patches for the norpm provider (https://review.openstack.org/#/c/435011 and the nova filter patch https://review.openstack.org/435099) we have a definite improvement in the number of restarts:
- nova-api went from 3 to 1
- nova-* went from 2 to 1
- swift has no restarts any longer
- neutron-* and httpd stayed at 1 and 2 respectively

Here are the restarts divided by steps:
* Step1
restart ntpd'
* Step2
restart ntpd'
* Step3
restart ntpd'
restart httpd'
* Step4
restart ntpd'
restart openstack-nova-conductor'
restart openstack-nova-scheduler'
restart openstack-nova-consoleauth'
restart openstack-nova-novncproxy'
restart httpd'
restart openstack-nova-api'
restart neutron-dhcp-agent'
restart neutron-server'
restart neutron-l3-agent'
restart neutron-metadata-agent'
* Step5
restart ntpd'

Emilien has a review up that will move all the wsgi configuration in a single step which should fix at least httpd.

Comment 3 Alex Schultz 2017-02-23 22:11:15 UTC

Added BZ 1426434 to track the norpm provider issue

Comment 4 Alan Pevec 2017-03-24 00:07:10 UTC

Looks like this is going to be used as a tracking bug?
If so, you can add Tracking keyword.

Comment 5 Alex Schultz 2017-03-28 14:42:38 UTC

*** Bug 1436728 has been marked as a duplicate of this bug. ***

Comment 6 Steven Hardy 2017-03-31 09:35:12 UTC

So we have a number of related upstream bugs, and I'm not sure we have downstream bzs associated with them (if we do please link them here):

https://bugs.launchpad.net/tripleo/+bug/1664650
https://bugs.launchpad.net/puppet-nova/+bug/1665443
https://bugs.launchpad.net/tripleo/+bug/1665405
https://bugs.launchpad.net/tripleo/+bug/1665426

Comment 20 Gurenko Alex 2017-05-24 13:43:23 UTC

Verified on latest build 2017-05-23.4.

Compute: only ssh restarted

Controller: only Apache (twice), Glance (twice) and Heat (once) restarted.

Does not look like there were any interruptions during deploy command re-run, I've used same output as in original comment to monitor during the re-run.

Comment 22 errata-xmlrpc 2017-06-28 14:44:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1585

Note You need to log in before you can comment on or make changes to this bug.