Description of problem: On Port_Binding event ovn metadata agent fails, that leads to no haproxy processes spawned in the namespace and newly created instances won't get access to metadata API. 2019-04-16 09:10:48.961 182681 INFO networking_ovn.agent.metadata.agent [-] Port 52e950ed-efad-4067-b14f-01c695e441e9 in datapath 20e76bff-3093-4a30-bb8a-3d7f17bf4be2 bound to our chassis 2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent [-] Configured OVN bridge br-migration cannot be found in the system.: KeyError: u'br-migration' 2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent Traceback (most recent call last): 2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 367, in provision_datapath 2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent ovs_bridges.remove(self.ovn_bridge) 2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent KeyError: u'br-migration' 2019-04-16 09:10:49.728 182681 ERROR networking_ovn.agent.metadata.agent 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event [-] Unexpected exception in notify_loop: KeyError: u'br-migration' 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event Traceback (most recent call last): 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/ovsdbapp/event.py", line 117, in notify_loop 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event match.run(event, row, updates) 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 67, in wrapped 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event return f(*args, **kwargs) 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 92, in run 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event self.agent.update_datapath(str(row.datapath.uuid)) 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 276, in update_datapath 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event self.provision_datapath(datapath) 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 371, in provision_datapath 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event "the system.", self.ovn_bridge) 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event self.force_reraise() 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event six.reraise(self.type_, self.value, self.tb) 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 367, in provision_datapath 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event ovs_bridges.remove(self.ovn_bridge) 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event KeyError: u'br-migration' 2019-04-16 09:10:49.729 182681 ERROR ovsdbapp.event Version-Release number of selected component (if applicable): python-networking-ovn-metadata-agent-5.0.2-0.20190307204430.6a774a0.el7ost.noarch How reproducible: Always Steps to Reproduce: 1. Migrate to OVN 2. Spawn an instance with keypair 3. Actual results: Instance won't get SSH key injected because of failing metadata Expected results: Additional info:
Minimal reproducer to this outside of the migration is: 1) Have a metadata agent running over bridge br-old 2) Stop ovn-controller 3) delete br-old 4) set ovn-bridge in ovsdb to an existing bridge (e.g. br-int) 5) start ovn-controller 6) bind a port on the node, that triggers PortBindingChassisEvent 7) Check metadata agent logs With the fix, you should see "Resyncing the agent." message. Without the fix, there will be trace as shown in comment 0.
The issue reproduces on : OpenStack/14.0-RHEL-7/2019-06-10.3 ()[neutron@controller-0 /]$ rpm -qa |grep -i ovn puppet-ovn-13.3.1-0.20181013120724.38e2e33.el7ost.noarch python-networking-ovn-5.0.2-0.20190430191337.e673daf.el7ost.noarch [ 38.687815] cloud-init[2799]: ci-info: ++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++ [ 38.770734] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+ [ 38.855210] cloud-init[2799]: ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | [ 38.920242] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+ [ 38.972816] cloud-init[2799]: ci-info: | 0 | 0.0.0.0 | 10.0.2.1 | 0.0.0.0 | eth0 | UG | [ 39.035308] cloud-init[2799]: ci-info: | 1 | 10.0.2.0 | 0.0.0.0 | 255.255.255.0 | eth0 | U | [ 39.088560] cloud-init[2799]: ci-info: | 2 | 169.254.169.254 | 10.0.2.2 | 255.255.255.255 | eth0 | UGH | [ 39.134530] cloud-init[2799]: ci-info: +-------+-----------------+----------+-----------------+-----------+-------+ [ OK ] Started Postfix Mail Transport Agent. [ 99.154421] cloud-init[2799]: 2019-06-17 05:56:41,227 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [50/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5c50>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))] [ TIME ] Timed out waiting for device dev-vi...s-org.qemu.guest_agent.0.device. [DEPEND] Dependency failed for QEMU Guest Agent. Starting oVirt Guest Agent... [ OK ] Started oVirt Guest Agent. [ 150.181815] cloud-init[2799]: 2019-06-17 05:57:32,253 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [101/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5690>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))] [ 168.204920] cloud-init[2799]: 2019-06-17 05:57:50,278 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5590>, 'Connection to 169.254.169.254 timed out. (connect timeout=17.0)'))] [ 169.213812] cloud-init[2799]: 2019-06-17 05:57:51,280 - DataSourceEc2.py[CRITICAL]: Giving up on md from ['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds [ 219.223338] cloud-init[2799]: 2019-06-17 05:58:41,296 - url_helper.py[WARNING]: Calling 'http://10.0.2.1/latest/meta-data/instance-id' failed [50/120s]: request error [HTTPConnectionPool(host='10.0.2.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012ec5dd0>, 'Connection to 10.0.2.1 timed out. (connect timeout=50.0)'))] [ 270.232291] cloud-init[2799]: 2019-06-17 05:59:32,305 - url_helper.py[WARNING]: Calling 'http://10.0.2.1/latest/meta-data/instance-id' failed [101/120s]: request error [HTTPConnectionPool(host='10.0.2.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7ff012eea310>, 'Connection to 10.0.2.1 timed out. (connect timeout=50.0)'))] It looks like the ovn_controller docker is unhealthy after migration
In order to verify, you need to have at least version written in "Fixed in version" field or higher. The older versions don't contain the fix. I'm flipping back to ON_QA
According to our records, this should be resolved by python-networking-ovn-5.0.2-0.20190430191338.e673daf.el7ost. This build is available now.
cant verify - depends on https://bugzilla.redhat.com/show_bug.cgi?id=1694572
Closing as bug 1694572 won't make it to OSP14 before EOL.