Bug 1421883 - Control Plane Services go down/have outage while rerunning openstack overcloud deploy
Summary: Control Plane Services go down/have outage while rerunning openstack overclou...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: z3
: 10.0 (Newton)
Assignee: Emilien Macchi
QA Contact: Gurenko Alex
URL:
Whiteboard:
: 1436728 (view as bug list)
Depends On: 1426434 1426439 1438099 1438602 1438886 1441736 1441738 1441757 1441760 1441769
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-14 00:10 UTC by Graeme Gillies
Modified: 2021-03-11 14:57 UTC (History)
18 users (show)

Fixed In Version: openstack-tripleo-heat-templates-5.2.0-18.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-28 14:44:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
output of loop listing servers (115.27 KB, text/plain)
2017-02-14 00:10 UTC, Graeme Gillies
no flags Details
log of overcloud deploy command (159.41 KB, text/plain)
2017-02-14 00:11 UTC, Graeme Gillies
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1664650 0 None None None 2017-02-14 17:30:03 UTC
OpenStack gerrit 450900 0 None MERGED Only set EnableConfigPurge on major upgrades 2021-01-04 08:19:27 UTC
Red Hat Product Errata RHBA-2017:1585 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 director Bug Fix Advisory 2017-06-28 18:42:51 UTC

Description Graeme Gillies 2017-02-14 00:10:43 UTC
Created attachment 1250080 [details]
output of loop listing servers

Hi,

I have a RHOS 10 environment which is completely deployed and functional.

What I then did is in a bash window run the following code

while true;do date;OS_CLOUD=rhosops-test-ggillies openstack server list;done 2>&1 | tee server_list_loop.log

I have attached the log. I have also attached the log of the overcloud deploy.

While this was running, I reran my openstack overcloud deploy command, with no changes, essentially this should be a noop operation, or at the very least, should not cause any control plane outage (as absolutely nothing is changing).

You will see in the log however that first keystone throws an error, and then nova throws an error a bit later.

This seems to indicate that all stack deploy operations are disruptive, including configuration changes, node scale up/down, etc.

This is easily reproducible

Regards,

Graeme

Comment 1 Graeme Gillies 2017-02-14 00:11:37 UTC
Created attachment 1250081 [details]
log of overcloud deploy command

Comment 2 Michele Baldessari 2017-02-23 15:01:18 UTC
So with Alex's patches for the norpm provider (https://review.openstack.org/#/c/435011 and the nova filter patch https://review.openstack.org/435099) we have a definite improvement in the number of restarts:
- nova-api went from 3 to 1
- nova-* went from 2 to 1
- swift has no restarts any longer
- neutron-* and httpd stayed at 1 and 2 respectively

Here are the restarts divided by steps:
* Step1
restart ntpd'
* Step2
restart ntpd'
* Step3
restart ntpd'
restart httpd'
* Step4
restart ntpd'
restart openstack-nova-conductor'
restart openstack-nova-scheduler'
restart openstack-nova-consoleauth'
restart openstack-nova-novncproxy'
restart httpd'
restart openstack-nova-api'
restart neutron-dhcp-agent'
restart neutron-server'
restart neutron-l3-agent'
restart neutron-metadata-agent'
* Step5
restart ntpd'

Emilien has a review up that will move all the wsgi configuration in a single step which should fix at least httpd.

Comment 3 Alex Schultz 2017-02-23 22:11:15 UTC
Added BZ 1426434 to track the norpm provider issue

Comment 4 Alan Pevec 2017-03-24 00:07:10 UTC
Looks like this is going to be used as a tracking bug?
If so, you can add Tracking keyword.

Comment 5 Alex Schultz 2017-03-28 14:42:38 UTC
*** Bug 1436728 has been marked as a duplicate of this bug. ***

Comment 6 Steven Hardy 2017-03-31 09:35:12 UTC
So we have a number of related upstream bugs, and I'm not sure we have downstream bzs associated with them (if we do please link them here):

https://bugs.launchpad.net/tripleo/+bug/1664650
https://bugs.launchpad.net/puppet-nova/+bug/1665443
https://bugs.launchpad.net/tripleo/+bug/1665405
https://bugs.launchpad.net/tripleo/+bug/1665426

Comment 20 Gurenko Alex 2017-05-24 13:43:23 UTC
Verified on latest build 2017-05-23.4.

Compute: only ssh restarted

Controller: only Apache (twice), Glance (twice) and Heat (once) restarted.

Does not look like there were any interruptions during deploy command re-run, I've used same output as in original comment to monitor during the re-run.

Comment 22 errata-xmlrpc 2017-06-28 14:44:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1585


Note You need to log in before you can comment on or make changes to this bug.