When running Browbeat+Rally scenario with netcreate-boot scenario (i.e., we spawn a VM and verify that it becomes ACTIVE) with concurrency set to 5 and times set to 2000, we observed that after spawning some 1422 VMs all subsequent VMs start to fail (i.e, they go into an ERROR state). When we looked at the logs, we observed that after spawning around 1200+ VMs the errors started to creep in. The errors are mostly related to "ODL L2 agent (i.e., neutron-odlagent-portbinding)" becoming inactive and hence port-binding fails. Error reported in Rally: Binding failed for port 000f35ae-9db5-4a9d-9324-255fe7c668d0, please check neutron logs for more information.', u'code': 500, u'created': u'2017-11-30T14:37:43Z'} It looks like networking-odl marks "ODL L2" agent as down when the information is not returned as part of the following query http://172.16.0.15:8081/restconf/operational/neutron:neutron/hostconfigs After the browbeat test completed, we could see that all the "ODL L2" agents were down/inactive. On the compute nodes, we see that vswitchd connection to ODL Controller is flaky and this could be the reason why the agent info is not returned as part of hostconfig url. <SNIP> 2017-11-30T17:40:12.968Z|01049|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connected 2017-11-30T17:40:12.969Z|01050|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connection closed by peer 2017-11-30T17:40:12.969Z|01051|rconn|INFO|br-int<->tcp:172.16.0.17:6653: continuing to retry connections in the background but suppressing further logging 2017-11-30T17:40:22.778Z|01052|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connected 2017-11-30T17:40:22.778Z|01053|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connection closed by peer <SNIP> Setup: OSP12 image ODL running in a non-cluster, non-container setup with 1 controller and 32 compute nodes. Port-status is disabled in this setup and inactivity_probe/max_backoff are appropriately configured. RPMs used: python-networking-odl-11.0.0-2.el7ost.noarch opendaylight-6.2.0-5.el7ost.noarch neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead. +--------------------------------------+----------------+------------------------------------+-------------------+-------+----------------+------------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+----------------+------------------------------------+-------------------+-------+----------------+------------------------------+ | 0201be96-3c80-451a-971e-dfdeb7ebbdd3 | ODL L2 | overcloud-compute-6.localdomain | | xxx | True | neutron-odlagent-portbinding | | 0283ed5d-b9c8-43c9-9c99-6c27b319b041 | ODL L2 | overcloud-compute-15.localdomain | | xxx | True | neutron-odlagent-portbinding | | 1ad29e22-cbf8-4384-b1e3-c14e3779a4ae | ODL L2 | overcloud-compute-18.localdomain | | xxx | True | neutron-odlagent-portbinding | | 29687ccb-aa71-4150-8059-75cdcc98b284 | ODL L2 | overcloud-compute-5.localdomain | | xxx | True | neutron-odlagent-portbinding | | 29b494c2-bfda-4cdc-b189-1713d66793b0 | ODL L2 | overcloud-compute-22.localdomain | | xxx | True | neutron-odlagent-portbinding | | 2e981caa-6d6e-4a0d-b716-d3f0d79c4968 | ODL L2 | overcloud-compute-12.localdomain | | xxx | True | neutron-odlagent-portbinding | | 3928ceb7-fc75-4309-a38d-2a8cad2b5666 | ODL L2 | overcloud-compute-14.localdomain | | xxx | True | neutron-odlagent-portbinding | | 4011168b-2bda-4bbc-8b3d-0458120b4c13 | ODL L2 | overcloud-compute-1.localdomain | | xxx | True | neutron-odlagent-portbinding | | 4404a197-73a2-41b7-a905-519a06318af0 | ODL L2 | overcloud-compute-30.localdomain | | xxx | True | neutron-odlagent-portbinding | | 4d148674-1e2f-4bcd-986a-505b8bc8f84f | ODL L2 | overcloud-compute-19.localdomain | | xxx | True | neutron-odlagent-portbinding | | 5623cb1c-0dcf-4126-b4f6-f19411e6ec5f | ODL L2 | overcloud-compute-26.localdomain | | xxx | True | neutron-odlagent-portbinding | | 5703d801-49dc-40d7-adda-e7949a6b6771 | ODL L2 | overcloud-compute-13.localdomain | | xxx | True | neutron-odlagent-portbinding | | 62d09361-4222-4a0a-8b91-d03dd324f1e7 | ODL L2 | overcloud-controller-0.localdomain | | xxx | True | neutron-odlagent-portbinding | | 66edd958-936d-48c9-99e3-4d9241315fe7 | ODL L2 | overcloud-compute-28.localdomain | | xxx | True | neutron-odlagent-portbinding | | 682cfb7e-9cd0-4cf8-a1c9-ea8c6450ec37 | ODL L2 | overcloud-compute-8.localdomain | | xxx | True | neutron-odlagent-portbinding | | 72292f55-6a4a-44f2-85a1-ac3640f1b9f0 | ODL L2 | overcloud-compute-9.localdomain | | xxx | True | neutron-odlagent-portbinding | | 75ae43a1-2dd8-4223-ae34-400bbd3b5ac8 | ODL L2 | overcloud-compute-11.localdomain | | xxx | True | neutron-odlagent-portbinding | | 7816271e-9c9c-49a6-9cc4-b56c945fd92b | ODL L2 | overcloud-compute-4.localdomain | | xxx | True | neutron-odlagent-portbinding | | 7b567e37-5ca3-4e3d-91de-31ae56191a2a | ODL L2 | overcloud-compute-25.localdomain | | xxx | True | neutron-odlagent-portbinding | | 7c7f5637-4242-4015-ac1d-ce0932346519 | ODL L2 | overcloud-compute-21.localdomain | | xxx | True | neutron-odlagent-portbinding | | 7f7c54d8-028e-458c-8231-c7774b86e1cf | ODL L2 | overcloud-compute-3.localdomain | | xxx | True | neutron-odlagent-portbinding | | 9270a276-9e0c-42a9-85f5-513bcb9e717a | ODL L2 | overcloud-compute-2.localdomain | | xxx | True | neutron-odlagent-portbinding | | 92b3383d-0fd3-4578-94e9-3e67239b86a9 | ODL L2 | overcloud-compute-24.localdomain | | xxx | True | neutron-odlagent-portbinding | | 94cc6e57-0ffe-4243-ac1d-46f17cee5ed7 | ODL L2 | overcloud-compute-31.localdomain | | xxx | True | neutron-odlagent-portbinding | | 9fc823e4-146b-417d-9c13-078b11d6e7b9 | ODL L2 | overcloud-compute-20.localdomain | | xxx | True | neutron-odlagent-portbinding | | 9feaf78a-f2da-477d-8053-cc599b496aa8 | ODL L2 | overcloud-compute-27.localdomain | | xxx | True | neutron-odlagent-portbinding | | a2b758db-06b7-490e-b6d0-d05938848a23 | ODL L2 | overcloud-compute-16.localdomain | | xxx | True | neutron-odlagent-portbinding | | a68eb489-1ddd-4ab6-bbc9-931aa892282c | ODL L2 | overcloud-compute-17.localdomain | | xxx | True | neutron-odlagent-portbinding | | ace04e9f-a647-471f-92c9-e3f10e036fd7 | ODL L2 | overcloud-compute-10.localdomain | | xxx | True | neutron-odlagent-portbinding | | b5fc5492-edf0-4321-bf9f-41f1c541e381 | ODL L2 | overcloud-compute-0.localdomain | | xxx | True | neutron-odlagent-portbinding | | bef93822-1da4-48b6-b80e-16733f02c858 | Metadata agent | overcloud-controller-0.localdomain | | :-) | True | neutron-metadata-agent | | c085c9cd-412c-4a28-8280-d5521de0099c | ODL L2 | overcloud-compute-29.localdomain | | xxx | True | neutron-odlagent-portbinding | | f2fce8b9-7584-4255-ab7e-02475d82826f | ODL L2 | overcloud-compute-23.localdomain | | xxx | True | neutron-odlagent-portbinding | | fc35e8f7-8e39-49df-9824-2447367bd59f | ODL L2 | overcloud-compute-7.localdomain | | xxx | True | neutron-odlagent-portbinding | | fdf5a3be-d458-461b-8e94-e99ecdf5e435 | DHCP agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | +--------------------------------------+----------------+------------------------------------+-------------------+-------+----------------+------------------------------+
Created attachment 1361711 [details] ovs-vswitchd log file from one of the compute nodes
Created attachment 1361712 [details] Karaf logs while tests were running
OVSDB is very chatty so updates from the ovsdb nodes to ODL cause cpu spikes when processed. Some thoughts to test out the theory: 1. ovsdb table updates that are too frequent. - skip-monitoring-manager-status in the southbound.cfg -- This flag can be used to reduce the number of status updates. It is best to use this in a single-node setup as the default value of true is used in HA setups. -- Manager table status entry: https://github.com/opendaylight/ovsdb/commit/5f8ca82c15d75d6c703cef2ded1072c4a588c93b#diff-493334811bc2df8974979e4168cf8fb1 2. What are the specs for the test setup? Hardware or virtual servers? CPUs and RAM? At scale this can have an impact as more and more mdsal processing is done. 3. Think it was mentioned that the ovs nodes are the ones disconnecting. That leads me to think that the ODL side is taking too long to respond and the ovs nodes disconnect. 4. Does disabling bfd help? May just affect tunnels so may not apply, but wasn't sure if ovs related bfd status to the connections. 5. Do the vswitchd logs show why ovs disconnects?
(In reply to Sam Hague from comment #3) > OVSDB is very chatty so updates from the ovsdb nodes to ODL cause cpu spikes > when processed. Some thoughts to test out the theory: > > 1. ovsdb table updates that are too frequent. > > - skip-monitoring-manager-status in the southbound.cfg > -- This flag can be used to reduce the number of status updates. It is best > to use this in a single-node setup as the default value of true is used in > HA setups. > > -- Manager table status entry: > https://github.com/opendaylight/ovsdb/commit/ > 5f8ca82c15d75d6c703cef2ded1072c4a588c93b#diff- > 493334811bc2df8974979e4168cf8fb1 > Thanks for the inputs @Sam. Yes, we did configure the skip-monitoring-manager-status = true during our testing. > 2. What are the specs for the test setup? Hardware or virtual servers? CPUs > and RAM? At scale this can have an impact as more and more mdsal processing > is done. Its all hardware/baremetal Dell R630 servers (2x E5-2683 v3 28 total cores, 56 threads) with 128GB Ram. > > 3. Think it was mentioned that the ovs nodes are the ones disconnecting. > That leads me to think that the ODL side is taking too long to respond and > the ovs nodes disconnect. I think its actually ODL/OVSDB which is closing the connection to the switch. > > 4. Does disabling bfd help? May just affect tunnels so may not apply, but > wasn't sure if ovs related bfd status to the connections. > > 5. Do the vswitchd logs show why ovs disconnects? This issue is for a single controller setup and yes there are some connection timed out errors in the log - https://bugzilla.redhat.com/attachment.cgi?id=1361711
Sridhar, would it make sense to run these tests again with port status enabled?
(In reply to Sam Hague from comment #7) > Sridhar, would it make sense to run these tests again with port status > enabled? IMHO I don't think we have to run the tests again with port-status enabled since its a completely different issue. The OVSDB connection to the Switch is flaky - we have to identify why its so and fix it/fine-tune-it.
Victor, can you please check if this is the same bug or different from bug 1560872 ?
Hi Mike, I have requested help from Sai to reproduce this in the scale lab for debugging. From looking at bug 1560872, the symptoms sound similiar, but from reading the bug details so far, I don't think they are the same. Bug 1560872 is reproduced by rebooting the compute node, for which there is also an upstream bug: https://jira.opendaylight.org/browse/OVSDB-444
So we are still seeing this issue when booting a bunch of VMs. The agent status keeps flapping every few seconds. https://gist.github.com/smalleni/261f70645f284fc5d7e084f4c2a4033e I would prefer to have a developer look at the setup when its happening. Also, what would eb the logs of interest? neutron-server logs?
Sai, When you get a chance tomorrow to try this again, please enable ovsdb logs: log:set TRACE org.opendaylight.ovsdb.southbound From what I saw in the attached ovs logs, it appears that ODL is closing the connection (as Sridhar mentioned). So, lets start by seeing what OVSDB is doing when we get into this situation.
Sai, I would like for you to enable the following debug logs on all compute nodes: sudo ovs-appctl -t ovs-vswitchd vlog/set connmgr sudo ovs-appctl -t ovs-vswitchd vlog/set rconn These logs will show transmission and receipt of echo messages from OVS on compute to ODL. Also, please configure the following on all compute nodes: sudo ovs-vsctl set Manager 06c2e304-dc68-4b1c-8752-a95121d8d528 inactivity_probe=30000 (Substitute uuid above with actual uuid) This will change the echo timeout from 5 secs to 30 seconds (from OVS on compute node). I suspect that we are encountering a situation where ODL is not able to respond to the echo from OVS on the compute node in a timely manner, under heavy loads. So, I want to increase this timer to confirm if the L2 Agent stays up. If this ends up being the case, we will have to: 1. Tune the timer to a reasonable value for scaled deployments 2. Perform more analysis to determine why 5 second default is not enough for echo messages. This may include: o Priority message queue for time-sensitive control packets o Monitor CPU, to see if ODL is CPU starved during this burst of VM creation o Look to see if messages are being dropped o Add additional logging to OVSDB (and JsonRpcEndpoint) to add debug logs for echo receipt/transmission to include node id information
Removing needinfo since the test with required logging was setup and a debug session was completed leading to some finding. Need to debug further in the future for a resolution.
Is this issue fixed? I am facing a similar issue in openstack, kindly let me know if there is a workaround or fix for this issue. Thanks!
We need to set a proper default inactivity probe timeout in OOO, hence changing component to THT.
Created attachment 1471387 [details] Neutron server debug logs from controller-0 I've ran some additional browbeat rally tests, with the controller inactivity_probe set to 300 secs. Even with that config, I was still seeing dead l2 agents. I wrote a script to poll the hostconfig every 5 secs from oper d/s on all control nodes, while also monitoring for dead l2 agents in a second script. What I could see, what that "openstack network agent list" would report dead agents, but the script polling on each controller had valid, accurate hostconfig present. Further analysis indicates that the networking-odl periodic task to get hostconfigs from ODL is not running every 30 seconds as is configured. In several cases, this task only ran every 2 mins, 1 sec, which is way too long in between updates, which will result in the l2 agent being marked as dead. In networking-odl/journal/periodic_task.py, there is a check to make sure the task is executed only after a specified interval. That check is being hit, and as a result, the hostconfig task (and other tasks) are being skipped. https://github.com/openstack/networking-odl/blob/stable/queens/networking_odl/journal/periodic_task.py#L96 It's seems possible that the last task execution is not being updated properly, and as a result, the was_periodic_task_executed_recently() is returning an incorrect value. https://github.com/openstack/networking-odl/blob/stable/queens/networking_odl/db/db.py#L163 I've attached the neutron debug logs. For reference, here are logs of interest, showing that hostconf periodic task being skipped: First healthcheck failure ========================== 2018-07-29 19:06:17.580 54 WARNING neutron.db.agents_db [req-cfeb3f33-d4eb-4b77-98a8-135ccfef0ebb - - - - -] Agent healthcheck: found 42 dead agents out of 48: Type Last heartbeat host ODL L2 2018-07-29 19:04:55 overcloud-1029pcompute-3.localdomain ODL L2 2018-07-29 19:04:55 overcloud-1029pcompute-8.localdomain 2018-07-29 19:08:08.596 54 WARNING neutron.db.agents_db [req-cfeb3f33-d4eb-4b77-98a8-135ccfef0ebb - - - - -] Agent healthcheck: found 42 dead agents out of 48: Ah, it is taking too long between odl host config updates. ========================================================== 2 mins, 1 sec = 121 secs.. too long 2018-07-29 19:00:17.533 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs: 2018-07-29 19:02:18.512 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs: 2 mins... too long 2018-07-29 19:04:18.384 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs: 2 mins 1 sec 2018-07-29 19:06:19.996 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs: Why is it taking so long between ODL hostconfig updates? The hostconfig periodic task is skipping execution because it thinks the task ran recently. periodic task to get host config ================================ 2018-07-29 19:00:17.416 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:00:18.193 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] hostconfig task has been finished 2018-07-29 19:00:47.423 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:00:47.429 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:01:17.419 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:01:17.425 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:01:47.439 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:01:47.445 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:02:17.445 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:02:17.479 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig. 2018-07-29 19:02:17.484 53 INFO networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] REST/GET hostconfigs from ODL 2018-07-29 19:02:19.159 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] hostconfig task has been finished 2018-07-29 19:02:47.469 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:02:47.475 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:03:17.489 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:03:17.495 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:03:47.509 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:03:47.514 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:04:17.526 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:04:17.537 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig. 2018-07-29 19:04:47.543 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:04:47.548 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:05:17.568 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:05:17.573 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:05:47.592 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:05:47.598 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-29 19:06:17.588 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-29 19:06:17.599 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig. 2018-07-29 19:06:17.609 53 INFO networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] REST/GET hostconfigs from ODL
Here are some additional debug logs showing hostconfig task being skipped on controller-0. Will check other controllers also in the cluster. [heat-admin@overcloud-controller-0 neutron]$ tail -F server.log | egrep 'Starting hostconfig periodic|Periodic hostconfig task executed after periodic interval Skipping execution|ODLPORTBINDING hostconfigs' 2018-07-30 13:39:50.681 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-30 13:39:50.686 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-30 13:40:20.677 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-30 13:40:20.683 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-30 13:40:50.693 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-30 13:40:50.698 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution. 2018-07-30 13:41:20.714 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-30 13:41:20.740 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs: 2018-07-30 13:41:50.724 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task. 2018-07-30 13:41:50.729 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.
Ran the browbeat test again, and monitored networking-odl hostconfig task on all controllers. For this run, I had modified the restconf_poll_interval to 15 secs, from the default 30 seconds. At the time the first dead agent is detected by neutron, the hostconfig task is running on controller-2. From the logs, it is clear that the hostconfig task took way too long to finish, in this case, 1 min 46 seconds. As a result, agents are marked dead. server.log:2018-07-30 23:23:29.662 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig. server.log:2018-07-30 23:23:29.702 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] ODLPORTBINDING hostconfigs: server.log:2018-07-30 23:23:53.390 54 DEBUG neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 48 active agents agent_health_check /usr/lib/python2.7/site-packages/neutron/db/agents_db.py:264 server.log:2018-07-30 23:24:30.445 54 WARNING neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 10 dead agents out of 48: server.log:2018-07-30 23:25:07.434 54 WARNING neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 36 dead agents out of 48: server.log:2018-07-30 23:25:15.089 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Finished _get_and_update_hostconfigs phase of hostconfig task. server.log:2018-07-30 23:25:15.111 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] hostconfig task has been finished server.log:2018-07-30 23:25:15.112 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Starting hostconfig periodic task. I looked at the full neutron log during this time period. I see these db exceptions, with retries. It looks like these exception/retries is what may be the root cause of the hostconfig task taking so long to complete. 018-07-30 23:23:42.497 35 DEBUG neutron.db.api [req-def0a09e-ca89-48c7-af20-8965a9d2689d - - - - -] Retry wrapper got retriable exception: UPDATE statement on table 'opendaylightjournal' expected to update 1 row(s); 0 were matched. wrapped /usr/lib/python2.7/site-packages/neutron/db/api.py:126 2018-07-30 23:23:42.497 35 DEBUG oslo_db.api [req-def0a09e-ca89-48c7-af20-8965a9d2689d - - - - -] Performing DB retry for function networking_odl.db.db.get_oldest_pending_db_row_with_lock wrapper /usr/lib/python2.7/site-packages/oslo_db/api.py:149
Also, in the neutron logs, the hostconfig is present for the nodes that are being marked as dead. This is just a portion of the hostconfig debug log.... 2018-07-30 23:23:29.676 53 DEBUG networking_odl.common.client [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Sending METHOD (get) URL (http://172.16.0.15:8081/restconf/operational/neutron:neutron/hostconfigs/) JSON (None) request /usr/lib/python2.7/site-packages/networking_odl/common/client.py:89 2018-07-30 23:23:29.702 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] ODLPORTBINDING hostconfigs: [ "{\"hostconfigs\":{\"hostconfig\":[{\"host-id\":\"overcloud-6018rcompute-3.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{ \\\"supported_vn", "ic_types\\\": [{ \\\"vnic_type\\\": \\\"normal\\\", \\\"vif_type\\\": \\\"ovs\\\", \\\"vif_details\\\": {} }], \\\"allowed_network_types\\\": [", "\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"], \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-1029pc", "ompute-20.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{ \\\"supported_vnic_types\\\": [{ \\\"vnic_type\\\": \\\"normal\\\", \\\"vif_typ", "e\\\": \\\"ovs\\\", \\\"vif_details\\\": {} }], \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"], \\\"bridge", "_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-1029pcompute-10.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{ ", "\\\"supported_vnic_types\\\": [{ \\\"vnic_type\\\": \\\"normal\\\", \\\"vif_type\\\": \\\"ovs\\\", \\\"vif_details\\\": {} }], \\\"allowed_netw", "ork_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"], \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"ov", "ercloud-1029pcompute-21.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{ \\\"supported_vnic_types\\\": [{ \\\"vnic_type\\\": \\\"normal\\\"", ", \\\"vif_type\\\": \\\"ovs\\\", \\\"vif_details\\\": {} }], \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre", "\\\"], \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-1029pcompute-22.localdomain\",\"host-type\":\"ODL L2\"", ",\"config\":\"{ \\\"supported_vnic_types\\\": [{ \\\"vnic_type\\\": \\\"normal\\\", \\\"vif_type\\\": \\\"ovs\\\", \\\"vif_details\\\": {} }], ", "\\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"], \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},", "{\"host-id\":\"overcloud-1029pcompute-11.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{ \\\"supported_vnic_types\\\": [{ \\\"vnic_type", "\\\": \\\"normal\\\", \\\"vif_type\\\": \\\"ovs\\\", \\\"vif_details\\\": {} }], \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\", "\"vxlan\\\",\\\"gre\\\"], \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-r620compute-1.localdomain\",\"host-ty", "pe\":\"ODL L2\",\"config\":\"{ \\\"supported_vnic_types\\\": [{ \\\"vnic_type\\\": \\\"normal\\\", \\\"vif_type\\\": \\\"ovs\\\", \\\"vif_details\\", "\": {} }], \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"], \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"",
Created attachment 1471656 [details] neutron hostconfig controller-0 log
Created attachment 1471657 [details] neutron hostconfig controller-1 log
Created attachment 1471658 [details] neutron hostconfig controller-2 log
Here is the egrep filter I used to parse the neutron logs to get the relevant hostconfig entries: egrep --line-buffered 'Starting hostconfig periodic|Periodic hostconfig task executed after periodic interval Skipping execution|ODLPORTBINDING hostconfigs|already running|hostconfig task has been finished|_get_and_update_hostconfigs|healthcheck' server.lo* > hostconfig-x.txt
Please see [1] for the proposed solution. We'll use that bug to track down the timeouts and this one to improve the networking-odl resiliency WRT the "L2 agents". [1] https://bugzilla.redhat.com/show_bug.cgi?id=1610546#c7
Summary of the investigation done last week. n-odl specific ============== 1. networking-odl hostconfig long agent db update cycle (referred to Vic in his original email) * The sources that relate to this problem can be many, but the thing is that the issue arises from a chain of causality. Let me explain, if there is an increase of I/O operations, CPU load, etc that affect the system (i.e. the DB), then the n-odl journal starts to grow. Something that evidences this issue is the RetryDBException that Vic referred. When the journal fills up, this adds an additional load to the DB server and in turn it accentuates the problem. Getting us into a negative cycle that starts easing up as the the journal is able to clear up. In general, the journal is a poor design for the needs we have, and that's why I am working in a solution that instead of scaling by operations, it scales by resource. This solution, which will only be available in OSP15, will be the next version of the driver and until then there is not much we can do about it this issue. 2. ODL REST API Timeout / Exception when retrieving hostconfig * The first scenario is where we have a timeout. Currently it won't do anything, so if there is a timeout because the REST API is having some issues, then it will affect the aliveness of the agents. I think in this case we could modify the code in n-odl so if there is a timeout we use cached entries to update the aliveness in Neutron's agent db, as the nodes (agents in Neutron lingo) are probably not dead. If any of the nodes are dead, then it will fail on scheduling a VM there and then it will be retried anyways, so it's not a big issue. * The second scenario is where we have an exception, e.g. connection refused. In these cases I think the best way is to handle it as it is, as probably ODL is dead or unresponsive, so even if the nodes are up the system won't be able to process things as expected and we might as well leave the agents be marked as dead. * There is a solution for this, however, and is using websockets instead of the REST API. That would mean that instead of polling status we would be pushing it. The code is already there but was never properly tested. As it can be enabled with a configuration change it is a solution that can make this cycle. Therefore, we decided together with Sai, Mike and Shridhar to try a deploy with this configuration and see if we see an improvement over the REST API mechanism. 3. The "timers nightmare" * There are 2 timers in play (there is actually more as Vic pointed out but for simplicity sake), one in Neutron and one in n-odl. As it stands, these timers need to be fine tuned together as a too low a value would add load to the server and too high a value would make the detection of dead nodes not reliable. Anyways, as we won't change the value for Neutron, as it affects all mechanisms and drivers, we need to play around with the one of n-odl, which anyways is dependent on the one in Neutron. There isn't much to say or do about this issue, but I think it would be a good improvement for the future to instead of making the n-odl configurable independently, we should make the config be a divider of Neutron's configured value (e.g. if we want to have 2 opportunities to update the agent db in Neutron before it gets checked (current default for Neutron is now 35 seconds), we would configure a divider of 2 and then the n-odl periodic task would run every 17 seconds). Said that, this is something that now, and in the future, will be a possible point of failure as the mechanism to update the agents is per host node and each transaction takes some time. As we scale up in number of nodes this might be a problem so perhaps we should request improvements be done in Neutron (e.g. bulk update) 4. Trivial problem: hostconfig might not be the best indication as to whether a node is up/down. * ovsdb only has hostconfig and that's what we are reporting through the API * but the presence of an openflow connection is actually the right indication for node aliveness. OpenStack in general ==================== 1. OpenStack services "hogging" the system * We have seen this in the past where Telemetry added a huge load to the controller nodes. * This issue is obviously related to the 1st one of n-odl specific problems. * A possible solution for the future is to put the ODL controller on a node of it's own and, thus, reduce the dependency of shared resources. 2. Agents are not marked/reactivated automatically again for Nova to schedule VMs * Perhaps Sai can expand here, as he explained this and he has more details he can provide. * This could potentially mean that even if we mark a node as alive again, the impact on the system remains until an operator manually mark the nodes as cleared to operate normally.
Fix submitted as we agreed. This fix ignores agent aliveness and try to bind the port disregarding that. The consequence of this is that a "neutron agent-list" call might show the agent as dead or alive but that's not information we should depend on to understand if we are binding to that node.
Fix merged upstream. Cherry picked for stable/queens [1] and stable/rocky [2]. [1] https://review.openstack.org/#/c/602046/ [2] https://review.openstack.org/#/c/602048/
Fix also merged on stable/rocky. Waiting for it to merge on stable/queens.
Merged on stable/queens
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3614