Bug 1700302 - [OVN migration] OVN metadata agent doesn't work after migration from ml2/ovs
Summary: [OVN migration] OVN metadata agent doesn't work after migration from ml2/ovs
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 14.0 (Rocky)
Assignee: Jakub Libosvar
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1694572 1715112
Blocks: 1667920 1672782 1672786 1672790
TreeView+ depends on / blocked
 
Reported: 2019-04-16 09:39 UTC by Jakub Libosvar
Modified: 2020-01-21 07:57 UTC (History)
7 users (show)

Fixed In Version: python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-26 13:38:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1824984 0 None None None 2019-04-16 11:43:45 UTC
OpenStack gerrit 655360 0 'None' MERGED metadata: Resync agent when misconfiguration is detected 2020-11-10 05:27:51 UTC
OpenStack gerrit 660987 0 'None' MERGED metadata: Resync agent when misconfiguration is detected 2020-11-10 05:27:51 UTC

Description Jakub Libosvar 2019-04-16 09:39:22 UTC
Description of problem:
On Port_Binding event ovn metadata agent fails, that leads to no haproxy processes spawned in the namespace and newly created instances won't get access to metadata API.

2019-04-16 09:10:48.961 182681 INFO networking_ovn.agent.metadata.agent [-] Port 52e950ed-efad-4067-b14f-01c695e441e9 in datapath 20e76bff-3093-4a30-bb8a-3d7f17bf4be2 bound to our chassis
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent [-] Configured OVN bridge br-migration cannot be found in the system.: KeyError: u'br-migration'
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent Traceback (most recent call last):
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 367, in provision_datapath
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent     ovs_bridges.remove(self.ovn_bridge)
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent KeyError: u'br-migration'
2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event [-] Unexpected exception in notify_loop: KeyError: u'br-migration'
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event Traceback (most recent call last):
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/ovsdbapp/event.py", line 117, in notify_loop
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     match.run(event, row, updates)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 67, in wrapped
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     return f(*args, **kwargs)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 92, in run
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     self.agent.update_datapath(str(row.datapath.uuid))
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 276, in update_datapath
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     self.provision_datapath(datapath)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 371, in provision_datapath
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     "the system.", self.ovn_bridge)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     self.force_reraise()
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     six.reraise(self.type_, self.value, self.tb)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 367, in provision_datapath
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event     ovs_bridges.remove(self.ovn_bridge)
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event KeyError: u'br-migration'
2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event

Version-Release number of selected component (if applicable):
python-networking-ovn-metadata-agent-5.0.2-0.20190307204430.6a774a0.el7ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Migrate to OVN
2. Spawn an instance with keypair
3.

Actual results:
Instance won't get SSH key injected because of failing metadata

Expected results:


Additional info:

Comment 1 Jakub Libosvar 2019-04-24 11:50:41 UTC
Minimal reproducer to this outside of the migration is:

1) Have a metadata agent running over bridge br-old
2) Stop ovn-controller
3) delete br-old
4) set ovn-bridge in ovsdb to an existing bridge (e.g. br-int)
5) start ovn-controller
6) bind a port on the node, that triggers PortBindingChassisEvent
7) Check metadata agent logs

With the fix, you should see "Resyncing the agent." message. Without the fix, there will be trace as shown in comment 0.

Comment 6 Eran Kuris 2019-06-17 06:12:16 UTC
The issue reproduces on : 
OpenStack/14.0-RHEL-7/2019-06-10.3
()[neutron@controller-0 /]$   rpm -qa |grep -i ovn
puppet-ovn-13.3.1-0.20181013120724.38e2e33.el7ost.noarch
python-networking-ovn-5.0.2-0.20190430191337.e673daf.el7ost.noarch


[   38.687815] cloud-init[2799]: ci-info: ++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
[   38.770734] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[   38.855210] cloud-init[2799]: ci-info: | Route |   Destination   | Gateway  |     Genmask     | Interface | Flags |
[   38.920242] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[   38.972816] cloud-init[2799]: ci-info: |   0   |     0.0.0.0     | 10.0.2.1 |     0.0.0.0     |    eth0   |   UG  |
[   39.035308] cloud-init[2799]: ci-info: |   1   |     10.0.2.0    | 0.0.0.0  |  255.255.255.0  |    eth0   |   U   |
[   39.088560] cloud-init[2799]: ci-info: |   2   | 169.254.169.254 | 10.0.2.2 | 255.255.255.255 |    eth0   |  UGH  |
[   39.134530] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
[  OK  ] Started Postfix Mail Transport Agent.
[   99.154421] cloud-init[2799]: 2019-06-17 05:56:41,227 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [50/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5c50>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[ TIME ] Timed out waiting for device dev-vi...s-org.qemu.guest_agent.0.device.
[DEPEND] Dependency failed for QEMU Guest Agent.
         Starting oVirt Guest Agent...
[  OK  ] Started oVirt Guest Agent.
[  150.181815] cloud-init[2799]: 2019-06-17 05:57:32,253 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [101/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5690>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[  168.204920] cloud-init[2799]: 2019-06-17 05:57:50,278 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5590>, 'Connection to 169.254.169.254 timed out. (connect timeout=17.0)'))]
[  169.213812] cloud-init[2799]: 2019-06-17 05:57:51,280 - DataSourceEc2.py[CRITICAL]: Giving up on md from ['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds
[  219.223338] cloud-init[2799]: 2019-06-17 05:58:41,296 - url_helper.py[WARNING]: Calling 'http://10.0.2.1/latest/meta-data/instance-id' failed [50/120s]: request error [HTTPConnectionPool(host='10.0.2.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5dd0>, 'Connection to 10.0.2.1 timed out. (connect timeout=50.0)'))]
[  270.232291] cloud-init[2799]: 2019-06-17 05:59:32,305 - url_helper.py[WARNING]: Calling 'http://10.0.2.1/latest/meta-data/instance-id' failed [101/120s]: request error [HTTPConnectionPool(host='10.0.2.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012eea310>, 'Connection to 10.0.2.1 timed out. (connect timeout=50.0)'))]




It looks like the ovn_controller docker is unhealthy after migration

Comment 7 Jakub Libosvar 2019-06-17 06:41:51 UTC
In order to verify, you need to have at least version written in "Fixed in version" field or higher. The older versions don't contain the fix. I'm flipping back to ON_QA

Comment 12 Lon Hohberger 2019-07-10 10:40:53 UTC
According to our records, this should be resolved by python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost.  This build is available now.

Comment 13 Eran Kuris 2019-10-30 09:02:12 UTC
cant verify - depends on https://bugzilla.redhat.com/show_bug.cgi?id=1694572

Comment 16 Jakub Libosvar 2019-11-26 13:38:10 UTC
Closing as bug 1694572 won't make it to OSP14 before EOL.


Note You need to log in before you can comment on or make changes to this bug.