Bug 1700302

Summary: [OVN migration] OVN metadata agent doesn't work after migration from ml2/ovs
Product: Red Hat OpenStack Reporter: Jakub Libosvar <jlibosva>
Component: python-networking-ovnAssignee: Jakub Libosvar <jlibosva>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: apevec, lhh, lmartins, majopela, njohnston, scohen, twilson
Target Milestone: z4Keywords: TestOnly, Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-26 13:38:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1694572, 1715112    
Bug Blocks: 1667920, 1672782, 1672786, 1672790    

Description Jakub Libosvar 2019-04-16 09:39:22 UTC
Description of problem:
On Port_Binding event ovn metadata agent fails, that leads to no haproxy processes spawned in the namespace and newly created instances won't get access to metadata API.

2019-04-16 09:10:48.961 182681 INFO networking_ovn.agent.metadata.agent [-] Port 52e950ed-efad-4067-b14f-01c695e441e9 in datapath 20e76bff-3093-4a30-bb8a-3d7f17bf4be2 bound to our chassis
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent [-] Configured OVN bridge br-migration cannot be found in the system.: KeyError: u'br-migration'
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent Traceback (most recent call last):
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 367, in provision_datapath
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent     ovs_bridges.remove(self.ovn_bridge)
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent KeyError: u'br-migration'
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event [-] Unexpected exception in notify_loop: KeyError: u'br-migration'
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event Traceback (most recent call last):
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/ovsdbapp/event.py", line 117, in notify_loop
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     match.run(event, row, updates)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 67, in wrapped
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     return f(*args, **kwargs)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 92, in run
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     self.agent.update_datapath(str(row.datapath.uuid))
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 276, in update_datapath
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     self.provision_datapath(datapath)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 371, in provision_datapath
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     "the system.", self.ovn_bridge)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     self.force_reraise()
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     six.reraise(self.type_, self.value, self.tb)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 367, in provision_datapath
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     ovs_bridges.remove(self.ovn_bridge)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event KeyError: u'br-migration'
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event

Version-Release number of selected component (if applicable):
python-networking-ovn-metadata-agent-5.0.2-0.20190307204430.6a774a0.el7ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Migrate to OVN
2. Spawn an instance with keypair
3.

Actual results:
Instance won't get SSH key injected because of failing metadata

Expected results:


Additional info:

Comment 1 Jakub Libosvar 2019-04-24 11:50:41 UTC
Minimal reproducer to this outside of the migration is:

1) Have a metadata agent running over bridge br-old
2) Stop ovn-controller
3) delete br-old
4) set ovn-bridge in ovsdb to an existing bridge (e.g. br-int)
5) start ovn-controller
6) bind a port on the node, that triggers PortBindingChassisEvent
7) Check metadata agent logs

With the fix, you should see "Resyncing the agent." message. Without the fix, there will be trace as shown in comment 0.

Comment 6 Eran Kuris 2019-06-17 06:12:16 UTC
The issue reproduces on : 
OpenStack/14.0-RHEL-7/2019-06-10.3
()[neutron@controller-0 /]$   rpm -qa |grep -i ovn
puppet-ovn-13.3.1-0.20181013120724.38e2e33.el7ost.noarch
python-networking-ovn-5.0.2-0.20190430191337.e673daf.el7ost.noarch


[   38.687815] cloud-init[2799]: ci-info: ++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
[   38.770734] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[   38.855210] cloud-init[2799]: ci-info: | Route |   Destination   | Gateway  |     Genmask     | Interface | Flags |
[   38.920242] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[   38.972816] cloud-init[2799]: ci-info: |   0   |     0.0.0.0     | 10.0.2.1 |     0.0.0.0     |    eth0   |   UG  |
[   39.035308] cloud-init[2799]: ci-info: |   1   |     10.0.2.0    | 0.0.0.0  |  255.255.255.0  |    eth0   |   U   |
[   39.088560] cloud-init[2799]: ci-info: |   2   | 169.254.169.254 | 10.0.2.2 | 255.255.255.255 |    eth0   |  UGH  |
[   39.134530] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[  OK  ] Started Postfix Mail Transport Agent.
[   99.154421] cloud-init[2799]: 2019-06-17 05:56:41,227 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [50/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5c50>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[ TIME ] Timed out waiting for device dev-vi...s-org.qemu.guest_agent.0.device.
[DEPEND] Dependency failed for QEMU Guest Agent.
         Starting oVirt Guest Agent...
[  OK  ] Started oVirt Guest Agent.
[  150.181815] cloud-init[2799]: 2019-06-17 05:57:32,253 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [101/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5690>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[  168.204920] cloud-init[2799]: 2019-06-17 05:57:50,278 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5590>, 'Connection to 169.254.169.254 timed out. (connect timeout=17.0)'))]
[  169.213812] cloud-init[2799]: 2019-06-17 05:57:51,280 - DataSourceEc2.py[CRITICAL]: Giving up on md from ['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds
[  219.223338] cloud-init[2799]: 2019-06-17 05:58:41,296 - url_helper.py[WARNING]: Calling 'http://10.0.2.1/latest/meta-data/instance-id' failed [50/120s]: request error [HTTPConnectionPool(host='10.0.2.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5dd0>, 'Connection to 10.0.2.1 timed out. (connect timeout=50.0)'))]
[  270.232291] cloud-init[2799]: 2019-06-17 05:59:32,305 - url_helper.py[WARNING]: Calling 'http://10.0.2.1/latest/meta-data/instance-id' failed [101/120s]: request error [HTTPConnectionPool(host='10.0.2.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012eea310>, 'Connection to 10.0.2.1 timed out. (connect timeout=50.0)'))]




It looks like the ovn_controller docker is unhealthy after migration

Comment 7 Jakub Libosvar 2019-06-17 06:41:51 UTC
In order to verify, you need to have at least version written in "Fixed in version" field or higher. The older versions don't contain the fix. I'm flipping back to ON_QA

Comment 12 Lon Hohberger 2019-07-10 10:40:53 UTC
According to our records, this should be resolved by python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost.  This build is available now.

Comment 13 Eran Kuris 2019-10-30 09:02:12 UTC
cant verify - depends on https://bugzilla.redhat.com/show_bug.cgi?id=1694572

Comment 16 Jakub Libosvar 2019-11-26 13:38:10 UTC
Closing as bug 1694572 won't make it to OSP14 before EOL.