1774581 – OSP16 update fails to get br-int and br-ctlplane back after undercloud update.

Bug 1774581 - OSP16 update fails to get br-int and br-ctlplane back after undercloud update.

Summary: OSP16 update fails to get br-int and br-ctlplane back after undercloud update.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.0 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	16.0 (Train on RHEL 8.1)
Assignee:	Cédric Jeanneret
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-20 13:43 UTC by Sofer Athlan-Guyot
Modified:	2020-02-06 14:43 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.1-0.20191126041653.414d4d9.el8ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-06 14:42:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	695650	'None'	MERGED	Ensure "network" service is enabled	2020-02-05 12:21:10 UTC
OpenStack gerrit	695869	'None'	MERGED	Ensure "network" service is enabled	2020-02-05 12:21:10 UTC
Red Hat Product Errata	RHEA-2020:0283	None	None	None	2020-02-06 14:43:22 UTC

Description Sofer Athlan-Guyot 2019-11-20 13:43:51 UTC

Description of problem:

Running a test OSP16 update from passed_phase1 to passed_phase1 I get an issue after undercloud update.  The br-int and br-ctlplane are down causing the rest of the test to fail.


5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 2e:20:7e:5d:f0:d9 brd ff:ff:ff:ff:ff:ff
6: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 66:c6:4c:c7:63:40 brd ff:ff:ff:ff:ff:ff
8: br-ctlplane: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:4d:6f:bd brd ff:ff:ff:ff:ff:ff

The undercloud logs can be seen there http://staging-jenkins2-qe-playground.usersys.redhat.com/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/20/artifact/undercloud-0.tar.gz for instance, and the jobs are there:

http://staging-jenkins2-qe-playground.usersys.redhat.com/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/


Version-Release number of selected component (if applicable): See associated jenkins job for exact puddle used.


How reproducible: Seems to happen each time.

Comment 2 Sofer Athlan-Guyot 2019-11-21 14:31:18 UTC

Look like the side car container fails to start or restart (we have a reboot at the end  of the undercloud procedure)

[stack@undercloud-0 ~]$ sudo podman ps --all | grep -E 'neutron-dnsmasq-qdhcp-ff859b54-ef63-43a3-b104-dd035fc08d11|ironic_neutron_agent|nova_conductor'
e775b23d63e1  undercloud-0.ctlplane.localdomain:8787/rh-osbs/rhosp16-openstack-neutron-dhcp-agent:20191115.1         dumb-init --singl...  About an hour ago  Exited (0) About an hour ago         neutron-dnsmasq-qdhcp-ff859b54-ef63-43a3-b104-dd035fc08d11
4aa767445381  undercloud-0.ctlplane.localdomain:8787/rh-osbs/rhosp16-openstack-nova-conductor:20191115.1             dumb-init --singl...  About an hour ago  Exited (0) About an hour ago         nova_conductor_init_log
fb4ca7e20fa4  undercloud-0.ctlplane.localdomain:8787/rh-osbs/rhosp16-openstack-ironic-neutron-agent:20191115.1       dumb-init --singl...  5 hours ago        Up 5 seconds ago                     ironic_neutron_agent
829f6f3c1964  undercloud-0.ctlplane.localdomain:8787/rh-osbs/rhosp16-openstack-nova-conductor:20191115.1             dumb-init --singl...  5 hours ago        Up 5 seconds ago                     nova_conductor

I'm starting to think that the container doesn't survive the reboot.

Comment 4 Brent Eagles 2019-11-21 16:45:32 UTC

This looks like a repeat or variation of https://bugzilla.redhat.com/show_bug.cgi?id=1666387.

Comment 5 Sofer Athlan-Guyot 2019-11-22 09:27:36 UTC

So I had confirmation that the problem is that networking doesn't come back after reboot of the undercloud, given the resolution of the previous instance of this bug I'm passing it on to DFG:DF.  It currently prevent OSP16 update testing.  Adding blocker flag as well.

Comment 6 Cédric Jeanneret 2019-11-22 09:33:45 UTC

Wondering if this might be linked to the "new" way to start/manage sidecars using systemd, introduced by Alex. Adding him as needinfo() - not 100% sure if it made to osp-16 yet.

Comment 7 Cédric Jeanneret 2019-11-22 09:37:43 UTC

duh, nah, not related to the sidecars - sorry, missread the whole thing.

Would be interesting to know if the "network" service is present and enabled. The patches linked in Brent BZ seem to point to that.

Comment 8 Cédric Jeanneret 2019-11-22 10:07:16 UTC

Could have a look at a live env - while "network" service is present, it's not enabled.

Comment 11 Alex Schultz 2019-11-22 14:23:01 UTC

This is likely because the networking service is not enabled for reboot like we had to do for the overcloud-full images.

Comment 12 Cédric Jeanneret 2019-11-25 11:43:37 UTC

Adding upstream backport.

Comment 13 Emilien Macchi 2019-11-26 11:56:11 UTC

Moving this one to POST, both master & stable/train backports are merged now.

Comment 18 Jad Haj Yahya 2019-12-04 10:28:57 UTC

UC and oc update passed:

http://staging-jenkins2-qe-playground.usersys.redhat.com/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/

Comment 21 errata-xmlrpc 2020-02-06 14:42:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283

Note You need to log in before you can comment on or make changes to this bug.