1790467 – Network unreachable after minor update: unable to pull rhosp16-openstack-neutron-metadata-agent-ovn image

Bug 1790467 - Network unreachable after minor update: unable to pull rhosp16-openstack-neutron-metadata-agent-ovn image

Summary: Network unreachable after minor update: unable to pull rhosp16-openstack-neut...

Keywords:
Status:	CLOSED DUPLICATE of bug 1801763
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.0 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z1
Target Release:	16.0 (Train on RHEL 8.1)
Assignee:	Brent Eagles
QA Contact:	Eduardo Olivares
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1783978 (view as bug list)
Depends On:
Blocks:	1801763
TreeView+	depends on / blocked

Reported:	2020-01-13 12:37 UTC by Eduardo Olivares
Modified:	2022-08-23 16:07 UTC (History)
CC List:	22 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-0.20200130144930.a8eb06c.el8ost
Doc Type:	Known Issue
Doc Text:	There is a known issue in Red Hat OpenStack Platform 16.0, where metadata information required for configuring OpenStack instances is not available, and instances might be started without connectivity. An ordering issue causes the haproxy wrappers not to be updated for the ovn_metadata_agent service. Workaround: Run the following Ansible command to restart the ovn_metadata_agent service on select nodes after the update to ensure that the ovn_metadata_agent service uses an updated version of the haproxy wrapper script: `ansible -b <nodes> -i /usr/bin/tripleo-ansible-inventory -m shell -a "status=`sudo systemctl is-active tripleo_ovn_metadata_agent`; if test \"$status\" == \"active\"; then sudo systemctl restart tripleo_ovn_metadata_agent; echo restarted; fi"` In this command, `nodes` can be a single node (for example, `compute-0`), all Compute nodes (for example, `compute*`) or `"all"`. After you restart the ovn_metadata_agent services, the services use the updated haproxy wrapper script and can provide metadata to VMs at startup. After you apply the workaround, affected VMs that are already running behave normally after a restart.
Clone Of:
Clones:	1801763 (view as bug list)
Environment:
Last Closed:	2020-02-11 15:35:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
tempest logs and ovn-metadata-agent logs (1.14 MB, application/gzip) 2020-01-13 12:37 UTC, Eduardo Olivares	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1860571	None	None	None	2020-01-22 15:30:56 UTC
Launchpad	1861799	None	None	None	2020-02-05 12:10:18 UTC
OpenStack gerrit	703855	None	MERGED	Add DeployIdentifier to extra config containers	2020-10-09 13:21:38 UTC
Red Hat Issue Tracker	OSP-3511	None	None	None	2022-08-23 16:07:43 UTC
Red Hat Product Errata	RHEA-2020:0283	None	None	None	2020-02-06 14:44:31 UTC

Description Eduardo Olivares 2020-01-13 12:37:29 UTC

Created attachment 1651854 [details]
tempest logs and ovn-metadata-agent logs

Description of problem:
Info obtained from CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/28/:
- 40 fails in second tempest run, after  update. 32 out of 40 because VM instance could not be accessed via ssh.
- 35 fails in third tempest run, after  overcloud reboot. 29 out of 35 because VM instance could not be accessed via ssh.

I have focused on a failing test that looks simple, test_basic_instance. Find tempest logs attached.


After having created network, subnet, secgroup, etc, a server (instance) is created at 19:44:02
2020-01-09 19:44:02,817 352985 INFO     [tempest.lib.common.rest_client] Request (NetworkBasicTest:test_basic_instance): 202 POST http://10.0.0.117:8774/v2.1/servers 1.601s
Response...
       Body: b'{"server": {"id": "fd91f455-3a2f-43d6-baea-348826eb8eb4", "links": [{"rel": "self", "href": "http://10.0.0.117:8774/v2.1/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}, {"rel": "bookmark", "href": "http://10.0.0.117:8774/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}], "OS-DCF:diskConfig": "MANUAL", "security_groups": [{"name": "tempest-secgroup-1181119668"}], "adminPass": "fX3x8b9aYpJM"}}'


It changes from status BUILD to ACTIVE at 19:44:26 according to tempest logs and FIP=10.0.0.214 is assigned at 19:44:28. Immediately, tempest tries to connect via ssh and all retries fail until 19:49:28.


OVN logs at compute-1 (find them attached):
2020-01-09 19:44:08.937 101840 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=None, old_conditions=None) to row=Port_Binding(tunnel_key=3, parent_port=[], logical_port=cad2bcee-aaa0-4aff-9f0f-1da971ac265f, chassis=[<ovs.db.idl.Row object at 0x7f405d79ae80>], mac=['fa:16:3e:8c:4a:2c 10.100.0.6'], options={'requested-chassis': 'compute-1.redhat.local'}, ha_chassis_group=[], gateway_chassis=[], encap=[], external_ids={'neutron:cidrs': '10.100.0.6/28', 'neutron:device_id': 'fd91f455-3a2f-43d6-baea-348826eb8eb4', 'neutron:device_owner': 'compute:nova', 'neutron:network_name': 'neutron-b32cf836-f6dc-4c7a-b7a6-b91351b5a2bb', 'neutron:port_name': '', 'neutron:project_id': '98143f9ea32643de999dd520d1918385', 'neutron:revision_number': '2', 'neutron:security_group_ids': '1b99205b-1b78-4623-ac29-1d690876418c'}, type=, nat_addresses=[], virtual_parent=[], datapath=1d57661f-61ce-4cc3-87ce-946427c38e6a, tag=[]) old=Port_Binding(chassis=[]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
...
2020-01-09 19:44:09.729 101840 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/1d57661f-61ce-4cc3-87ce-946427c38e6a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
...
Error: unable to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: unable to pull image: Error initializing source docker://undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: pinging docker registry returned: Get http://undercloud-0.ctlplane.redhat.local:8787/v2/: dial tcp 192.168.24.1:8787: connect: network is unreachable

It crashes at this moment (19:44:10) and there are more crashes related with these values at 19:44:25 and at 19:44:29. And there other crashes, but related with other values, I guess that's because other tests are running in parallel.


For some reason, no crashes in compute-0:
[root@titan07 neutron]# grep -c Traceback ../../../../../compute-1/var/log/containers/neutron/*
../../../../../compute-1/var/log/containers/neutron/kill-script.log:0
../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log:159
../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log.1:168
../../../../../compute-1/var/log/containers/neutron/privsep-helper.log:0
[root@titan07 neutron]# grep -c Traceback ../../../../../compute-0/var/log/containers/neutron/*
../../../../../compute-0/var/log/containers/neutron/kill-script.log:0
../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log:0
../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log.1:0
../../../../../compute-0/var/log/containers/neutron/privsep-helper.log:0








Version-Release number of selected component (if applicable):
update from:
RHOS_TRUNK-16.0-RHEL-8-20200103.n.1
to:
RHOS_TRUNK-16.0-RHEL-8-20200109.n.1
(tested with RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 as well, same fails)

How reproducible:
100% (failed 3/3)


Steps to Reproduce:
1. Run CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/
2.
3.

Actual results:
many tempest tests failing with reason: 
tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.214 via SSH timed out.
User: cirros, Password: None

Expected results:
tests passing

Additional info:

Comment 1 Eduardo Olivares 2020-01-14 08:09:18 UTC

Scenario is reproduced in titan07.lab.eng.tlv2.redhat.com and will be available for some time for debugging. It was reproduced updating from RHOS_TRUNK-16.0-RHEL-8-20200109.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200110.n.3

The issue only happens for VMs created at compute-0. VM creation at compute-1 is successful.


According to the logs, undercloud-0.ctlplane.redhat.local is unreachable, but I checked it is reachable from both computes.

Comment 2 Jakub Libosvar 2020-01-14 14:39:11 UTC

The problem here is that metadata-agent and its side-car containers are not restarted during an update because restart leads to metadata api being not available. That means metadata agent runs with old bits and tries to spawn a side-car container for a new workload in given network with image from the pre-update version.

This is not OVN specific and we think it will also cause issues with ml2/ovs side-car containers.

Comment 3 Kamil Sambor 2020-01-14 17:26:08 UTC

Small update: containers run on correct images, but in ovn_metadata_agent_wrapper we have wrong name of image . I still need to check way this file is not recreated correctly

Comment 6 Jakub Libosvar 2020-01-21 14:40:21 UTC

*** Bug 1783978 has been marked as a duplicate of this bug. ***

Comment 24 errata-xmlrpc 2020-02-06 14:44:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283

Comment 25 Eduardo Olivares 2020-02-06 15:47:51 UTC

I think this bug has been wrongly closed. The original fix was rejected and is not available in osp16.0 (see target release=z1).
Greg and Brent are discussing how to document it as a known issue. It makes no sense to document it as a known issue and to included in the errata as a fixed bug.

The status change made by the errata tool (see comment 23) from ON_DEV to RELEASE_PENDING makes no sense to me.

Maybe this happened because I moved it from ON_QA to ON_DEV (should have moved it to ASSIGNED). Sorry for that.

Comment 26 Lon Hohberger 2020-02-11 15:35:59 UTC

This bug is, unfortunately, not fixed as described in comment #24. We have opened a new bug for tracking this. Please refer to bug 1801763 for future updates.

*** This bug has been marked as a duplicate of bug 1801763 ***

Note You need to log in before you can comment on or make changes to this bug.

amcleod
amuller
apevec
bdobreli
beagles
dalvarez
ekuris
gregraka
jlibosva
ksambor
lhh
lpeer
majopela
mbaldessari
mburns
mjozefcz
njohnston
pgrist
rsafrono
sclewis
scohen
shrjoshi