Created attachment 1428839 [details] opendaylight_osp13_non-ha-results.html Description of problem: some tempest tests in downstream CI have a problem running tests that involve communication (i.e.: scenario tests) when the issue happens 95% of scenario tempest tests fail due "Timed out waiting for <floating_ip_here> to become reachable" (see attachment: opendaylight_osp13_non-ha-results.html) the VM does _not_ even get fixed IP it looks like odl is connected to ovs (I see "is connected: true" on ovs side) on computes The neutron server log returns 404s, similar to [10] which was reported against HA setups. Version-Release number of selected component (if applicable): non-ha deployment of osp13 with puddle 2018-04-24.1 and opendaylight-8.0.0-8 How reproducible: 25% of jobs affected Steps to Reproduce: 1. deploy osp13 overcloud and it will finish succesfully 2. run tempest scenario test on the overcloud and observe the failures 3. Actual results: deployment suceeds but tempest tests fail Expected results: deployment succeed and tempest tests pass Additional info: neutron log is full of those: () _check_response /usr/lib/python2.7/site-packages/networking_odl/common/client.py:143 2018-04-29 17:01:34.893 29 DEBUG neutron_lib.callbacks.manager [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Notify callbacks [] for network, before_complete _notify_loop /usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py:167 2018-04-29 17:01:34.905 29 DEBUG oslo_concurrency.lockutils [req-a4c72306-cee7-4a98-9870-0a0aa30fecb4 c083ac4eb94948cfa68f08f423b6d63f c0d1b22e24b64ca9b86a0e92ef568467 - - -] Lock "event-dispatch" released by "neutron.plugins.ml2.ovo_rpc.dispatch_events" :: held 0.055s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-04-29 17:01:34.906 29 INFO networking_odl.journal.journal [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Completed (Entry ID: 3073) - delete network 572453a8-b097-45c9-af1e-2972032fbbcf (Time stamp: 63660618094.9) 2018-04-29 17:01:34.912 29 INFO neutron.wsgi [req-a4c72306-cee7-4a98-9870-0a0aa30fecb4 c083ac4eb94948cfa68f08f423b6d63f c0d1b22e24b64ca9b86a0e92ef568467 - default default] 172.17.1.21 "DELETE /v2.0/networks/572453a8-b097-45c9-af1e-2972032fbbcf HTTP/1.1" status: 204 len: 168 time: 1.0424361 2018-04-29 17:01:34.917 28 DEBUG neutron.wsgi [-] (28) accepted ('172.17.1.21', 33162) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-04-29 17:01:34.921 29 INFO networking_odl.journal.journal [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Processing (Entry ID: 3074) - update network 572453a8-b097-45c9-af1e-2972032fbbcf (Time stamp: 63660618094.9) 2018-04-29 17:01:34.921 29 DEBUG networking_odl.common.client [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Sending METHOD (put) URL (http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf) JSON ({ "network": { "provider:physical_network": null, "ipv6_address_scope": null, "revision_number": 5, "port_security_enabled": true, "provider:network_type": null, "router:external": false, "availability_zone_hints": [], "availability_zones": [ "nova" ], "ipv4_address_scope": null, "shared": false, "description": "", "tags": [], "updated_at": "2018-04-29T16:01:34Z", "provider:segmentation_id": null, "name": "tempest-network-smoke--751380456", "admin_state_up": true, "created_at": "2018-04-29T16:00:17Z", "mtu": 1450, "vlan_transparent": null } }) request /usr/lib/python2.7/site-packages/networking_odl/common/client.py:89 2018-04-29 17:01:34.927 29 DEBUG networking_odl.common.client [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Exception from ODL: 404 Client Error: Not Found for url: http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf Network UUID does not exist. _check_response /usr/lib/python2.7/site-packages/networking_odl/common/client.py:140 2018-04-29 17:01:34.927 29 ERROR networking_odl.common.client Traceback (most recent call last): 2018-04-29 17:01:34.927 29 ERROR networking_odl.common.client File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 136, in _check_response 2018-04-29 17:01:34.927 29 ERROR networking_odl.common.client response.raise_for_status() 2018-04-29 17:01:34.927 29 ERROR networking_odl.common.client File "/usr/lib/python2.7/site-packages/requests/models.py", line 928, in raise_for_status 2018-04-29 17:01:34.927 29 ERROR networking_odl.common.client raise HTTPError(http_error_msg, response=self) 2018-04-29 17:01:34.927 29 ERROR networking_odl.common.client HTTPError: 404 Client Error: Not Found for url: http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf 2018-04-29 17:01:34.927 29 ERROR networking_odl.common.client 2018-04-29 17:01:34.928 29 ERROR networking_odl.common.client [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] REST request ( put ) to url ( networks/572453a8-b097-45c9-af1e-2972032fbbcf ) is failed. Request body : [{u'network': {'provider:physical_network': None, 'ipv6_address_scope': None, 'revision_number': 5, 'port_security_enabled': True, 'provider:network_type': None, 'router:external': False, 'availability_zone_hints': [], 'availability_zones': [u'nova'], 'ipv4_address_scope': None, 'shared': False, 'description': u'', 'tags': [], 'updated_at': '2018-04-29T16:01:34Z', 'provider:segmentation_id': None, 'name': u'tempest-network-smoke--751380456', 'admin_state_up': True, 'created_at': '2018-04-29T16:00:17Z', 'mtu': 1450, 'vlan_transparent': None}}] service: HTTPError: 404 Client Error: Not Found for url: http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Error while processing (Entry ID: 3074) - update network 572453a8-b097-45c9-af1e-2972032fbbcf (Time stamp: 63660618094.9): HTTPError: 404 Client Error: Not Found for url: http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal Traceback (most recent call last): 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/journal/journal.py", line 284, in _sync_entry 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal self.client.sendjson(method, urlpath, to_send) 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 106, in sendjson 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal 'body': obj}) 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal self.force_reraise() 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal six.reraise(self.type_, self.value, self.tb) 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 98, in sendjson 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal self.request(method, urlpath, data)) 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 140, in _check_response 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal {'e': error, 'text': response.text}, exc_info=1) 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal self.force_reraise() 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal six.reraise(self.type_, self.value, self.tb) 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 136, in _check_response 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal response.raise_for_status() 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/models.py", line 928, in raise_for_status 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal raise HTTPError(http_error_msg, response=self) 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal HTTPError: 404 Client Error: Not Found for url: http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf 2018-04-29 17:01:34.928 29 ERROR networking_odl.journal.journal 2018-04-29 17:01:34.945 29 INFO networking_odl.journal.journal [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Processing (Entry ID: 3074) - update network 572453a8-b097-45c9-af1e-2972032fbbcf (Time stamp: 63660618094.9) 2018-04-29 17:01:34.946 29 DEBUG networking_odl.common.client [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Sending METHOD (put) URL (http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf) JSON ({ "network": { "provider:physical_network": null, "ipv6_address_scope": null, "revision_number": 5, "port_security_enabled": true, "provider:network_type": null, "router:external": false, "availability_zone_hints": [], "availability_zones": [ "nova" ], "ipv4_address_scope": null, "shared": false, "description": "", "tags": [], "updated_at": "2018-04-29T16:01:34Z", "provider:segmentation_id": null, "name": "tempest-network-smoke--751380456", "admin_state_up": true, "created_at": "2018-04-29T16:00:17Z", "mtu": 1450, "vlan_transparent": null } }) request /usr/lib/python2.7/site-packages/networking_odl/common/client.py:89 2018-04-29 17:01:34.950 29 DEBUG networking_odl.common.client [req-e5a5019d-7368-49b9-9f72-43ad888e0b66 - - - - -] Exception from ODL: 404 Client Error: Not Found for url: http://172.17.1.14:8081/controller/nb/v2/neutron/networks/572453a8-b097-45c9-af1e-2972032fbbcf Network UUID does not exist. _check_response /usr/lib/python2.7/site-packages/networking_odl/common/client.py:140
Created attachment 1428841 [details] opendaylight_api_container.log - it has karaf and docker log combined
see opendaylight_api_container.log for karaf.log more data (ovs, datastore, sosreports dumps) available on demand (too many to include here)
This seems like the same bug as BZ#1568976 - if not please reopen. *** This bug has been marked as a duplicate of bug 1568976 ***
since this bug is reported against non-ha setup and symptoms are slightly differnt I'm reopenning
this bug looks like https://bugzilla.redhat.com/show_bug.cgi?id=1568976 with that different my deployment works but fails vast majority of tests after (like ODL didn't start correctly in the first place - the other bug)
What I don't understand here is how the deployment succeeded. If this is noha and this controller is replying with 404, then I'm not sure how it got past the netvirt:1 check. Can you try curling that and see if it it responds 200 OK? http://<odl ip>:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 If that doesn't work then that means ODL was working (at least temporarily during deployment) and then went into this state of 404.
can't reproduce in 20+ subsequent runs
I looks like it happens less often, in ~ 10% of the non-ha deployments, controller and compute logs at: https://url.corp.redhat.com/444497d
Created attachment 1432472 [details] tempest tests results from DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-1cont_2comp-ipv4-vxlan-tempest #80
Can you please attach the recent neutron & odl logs relevant to this issue?
Created attachment 1434425 [details] metadta-agent_died
Created attachment 1434474 [details] neutron server.log.4
Created attachment 1434475 [details] sosreport controller-0
Mike, metadata agent died at the start of the deployment - it maybe the reason for the massive problem (I thought VMs would get IPs via DHCP rather than cloud-init but maybe that's changed) - see attached metadta-agent_died metadata-agent log is from the same build "tempest tests results from DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-1cont_2comp-ipv4-vxlan-tempest #80" attachment in case it's not metadata agent is our cause of trouble I'm attaching neutron server log from the times the first tempest test failed (test_minimum_basic_scenario) at around 2018-05-04 15:25 (or 20:25 in neutron server log), the VM in that suppose to get 10.0.0.10 fixed IP (port ID: 017b1ad1-1d50-46b8-9f95-53fe57eb93e9) for other logs, confs and captures I'm attaching sosreport captured at exact moment of the above tempest test failure, NOTE to get nova compute logs I'd have to attach sosreports/logs from computes - let me know if they're needed
(In reply to Waldemar Znoinski from comment #16) > Mike, metadata agent died at the start of the deployment - it maybe the > reason for the massive problem (I thought VMs would get IPs via DHCP rather > than cloud-init but maybe that's changed) - see attached metadta-agent_died > metadata-agent log is from the same build "tempest tests results from > DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-1cont_2comp-ipv4- > vxlan-tempest #80" attachment > > in case it's not metadata agent is our cause of trouble I'm attaching > neutron server log from the times the first tempest test failed > (test_minimum_basic_scenario) at around 2018-05-04 15:25 (or 20:25 in > neutron server log), the VM in that suppose to get 10.0.0.10 fixed IP (port > ID: 017b1ad1-1d50-46b8-9f95-53fe57eb93e9) > for other logs, confs and captures I'm attaching sosreport captured at > exact moment of the above tempest test failure, NOTE to get nova compute > logs I'd have to attach sosreports/logs from computes - let me know if > they're needed I looked at the ODL & Neutron logs but can see nothing wrong there, they both seem to be working fine. I wasn't able to find a log of the VM console, but can you check in the VM console logs if the VM is getting an IP from DHCP or is it trying the metadata agent?
Mike you can find the VMs console output in the 'results.html' file attached along with tracebacks (if the tempest tests is written to display it), I can't see any dhcp (client) related messages but I see the metadata problems
(In reply to Waldemar Znoinski from comment #20) > Mike you can find the VMs console output in the 'results.html' file attached > along with tracebacks (if the tempest tests is written to display it), I > can't see any dhcp (client) related messages but I see the metadata problems It doesn't seem like the VM is trying to get an IP from the DHCP, so it won't get a route to the 169.254 subnet set up. Can you check a few times with one of those failing tests and a cirros image and see if this reproduces?
This one seems to be more of an order issue form n-odl. For starters, I see some entryIds missing in logs = 24, 25, 40. Then there is this: 2018-05-04 20:04:18.929 29 INFO networking_odl.journal.journal [req-454abce7-0c44-489b-8381-ee04c691bc69 - - - - -] Completed (Entry ID: 45) - delete network c2e3b9b3-b659-4e4f-8bcd-66a9c7f8a62a (Time stamp: 63661061058.9) 2018-05-04 20:04:18.941 29 INFO networking_odl.journal.journal [req-454abce7-0c44-489b-8381-ee04c691bc69 - - - - -] Processing (Entry ID: 46) - update network c2e3b9b3-b659-4e4f-8bcd-66a9c7f8a62a (Time stamp: 63661061058.9) 2018-05-04 20:04:18.942 29 DEBUG networking_odl.common.client [req-454abce7-0c44-489b-8381-ee04c691bc69 - - - - -] Sending METHOD (put) URL (http://172.17.1.10:8081/controller/nb/v2/neutron/networks/c2e3b9b3-b659-4e4f-8bcd-66a9c7f8a62a) JSON ({ "network": { ^This PUT fails coz we just deleted the network in previous entry. There are similar entries for other networks: 2018-05-04 20:04:21.919 30 INFO networking_odl.journal.journal [req-9208cd05-724f-4946-ae4a-5a30e8154434 - - - - -] Completed (Entry ID: 51) - delete network 92d810a9-4a99-427e-8aea-ab2720846914 (Time stamp: 63661061061.9) 2018-05-04 20:04:21.930 30 INFO networking_odl.journal.journal [req-9208cd05-724f-4946-ae4a-5a30e8154434 - - - - -] Processing (Entry ID: 52) - update network 92d810a9-4a99-427e-8aea-ab2720846914 (Time stamp: 63661061061.9) 2018-05-04 20:04:21.931 30 DEBUG networking_odl.common.client [req-9208cd05-724f-4946-ae4a-5a30e8154434 - - - - -] Sending METHOD (put) URL (http://172.17.1.10:8081/controller/nb/v2/neutron/networks/92d810a9-4a99-427e-8aea-ab2720846914) JSON ({ There are about 708/6 = 118 such networks out of 119 and only networks. so the issue is different from HA one. That is likely due to unstable cluster and this is due to network events being processed out of order.
thanks Vishal, if I read it correctly, for these "requests ordering" issues there's nothing we can do in n-odl to fix and https://bugzilla.redhat.com/show_bug.cgi?id=1550699 would get it sorted once for good?
No, this reordering is a current driver issue and probably fixable in driver. Unless openstack sent a delete before an update this is case of driver reversing the order. https://bugzilla.redhat.com/show_bug.cgi?id=1550699 is more for dependency checks i.e. subnet can't be created till network is created. I think even with 1550699 in place we will need to address this issue. ODL would take care of dependencies but would still expect events to come in right order. Mike can probably answer this better.
Mike, would https://bugzilla.redhat.com/show_bug.cgi?id=1519783 fix these n-odl ordering issues?
(In reply to Vishal Thapar from comment #25) > No, this reordering is a current driver issue and probably fixable in > driver. Unless openstack sent a delete before an update this is case of > driver reversing the order. Actually as you can see by the entry ids which are sequential the update was received after the delete by neutron. You can cross check in the log and see if it's indeed the case, but since the entry id is generated sequentially there seems to be no other explanation. Either way I doubt that this is the cause of the seen behavior, since receiving an update on a deleted resource shouldn't cause the VMs not to receive an IP address. > https://bugzilla.redhat.com/show_bug.cgi?id=1550699 is more for dependency > checks i.e. subnet can't be created till network is created. I think even > with 1550699 in place we will need to address this issue. ODL would take > care of dependencies but would still expect events to come in right order. > Mike can probably answer this better. Yes that bug is unrelated to this..
Mike can that be the case openstack decided to delete the port (and we see that delete message) because of the missing/late update for that port (i.e.: it suppose to got state change to active but it hasn't become active quick enough and was marked to be deleted) ?
Assigning this to Vishal to investigate, I think the ordering issue is a Red Herring but let's see what he comes up with.
Looks like another case of missing Table48 and n-odl issue being a red herring. cookie=0x8000006, duration=457.462s, table=47, n_packets=0, n_bytes=0, priority=5,ct_state=+dnat,ip,metadata=0x30e22/0xfffffe actions=load:0->NXM_OF_IN_PORT[],resubmit(,21) cookie=0x80513f8, duration=456.483s, table=50, n_packets=0, n_bytes=0, priority=20,metadata=0x6013f8000000/0xfffffffff000000,dl_src=fa:16:3e:e6:20:ad actions=goto_table:51 cookie=0x80313f8, duration=456.483s, table=51, n_packets=0, n_bytes=0, priority=20,metadata=0x13f8000000/0xffff000000,dl_dst=fa:16:3e:e6:20:ad actions=load:0x6000->NXM_NX_REG6[],resubmit(,220) cookie=0x80313f8, duration=424.045s, table=51, n_packets=0, n_bytes=0, priority=20,metadata=0x13f8000000/0xffff000000,dl_dst=fa:16:3e:83:f7:44 actions=set_field:0x35->tun_id,load:0x400->NXM_NX_REG6[],resubmit(,220) cookie=0x870138a, duration=2102.008s, table=52, n_packets=0, n_bytes=0, priority=5,metadata=0x138a000000/0xffff000001 actions=write_actions(group:210004) Table48 and default flows for 50 and 51 are both missing. *** This bug has been marked as a duplicate of bug 1568989 ***