1519925 – [Scale][Netvirt] ODL L2 agent status is flaky when launching large number of VMs

Bug 1519925 - [Scale][Netvirt] ODL L2 agent status is flaky when launching large number of VMs

Summary: [Scale][Netvirt] ODL L2 agent status is flaky when launching large number of VMs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-networking-odl
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z3
Target Release:	13.0 (Queens)
Assignee:	Michel Peterson
QA Contact:	Noam Manos
Docs Contact:
URL:
Whiteboard:	Scale
Depends On:	1609037
Blocks:	1625581
TreeView+	depends on / blocked

Reported:	2017-12-01 17:33 UTC by Sridhar Gaddam
Modified:	2019-07-15 15:02 UTC (History)
CC List:	15 users (show)
Fixed In Version:	python-networking-odl-12.0.1-0.20180228140717.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-13 23:32:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovs-vswitchd log file from one of the compute nodes (231.92 KB, text/plain) 2017-12-01 17:40 UTC, Sridhar Gaddam	no flags	Details
Karaf logs while tests were running (4.09 MB, application/x-xz) 2017-12-01 17:48 UTC, Sridhar Gaddam	no flags	Details
Neutron server debug logs from controller-0 (13.22 MB, text/plain) 2018-07-29 21:28 UTC, Victor Pickard	no flags	Details
neutron hostconfig controller-0 log (3.41 MB, application/x-gzip) 2018-07-31 01:44 UTC, Victor Pickard	no flags	Details
neutron hostconfig controller-1 log (3.29 MB, application/x-gzip) 2018-07-31 01:45 UTC, Victor Pickard	no flags	Details
neutron hostconfig controller-2 log (3.43 MB, application/x-gzip) 2018-07-31 01:46 UTC, Victor Pickard	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenDaylight Bug	OVSDB-444	None	None	None	2018-04-02 12:57:09 UTC
OpenDaylight Bug	OVSDB-455	None	None	None	2018-03-07 15:30:10 UTC
OpenStack gerrit	599990	None	MERGED	Revert "pseudo port binding: teach agent aliveness"	2020-11-09 05:37:30 UTC
OpenStack gerrit	602046	None	MERGED	Revert "pseudo port binding: teach agent aliveness"	2020-11-09 05:37:30 UTC
OpenStack gerrit	602048	None	MERGED	Revert "pseudo port binding: teach agent aliveness"	2020-11-09 05:37:30 UTC
Red Hat Product Errata	RHBA-2018:3614	None	None	None	2018-11-13 23:34:31 UTC

Description Sridhar Gaddam 2017-12-01 17:33:46 UTC

When running Browbeat+Rally scenario with netcreate-boot scenario (i.e., we spawn a VM and verify that it becomes ACTIVE) with concurrency set to 5 and times set to 2000, we observed that after spawning some 1422 VMs all subsequent VMs start to fail (i.e, they go into an ERROR state).

When we looked at the logs, we observed that after spawning around 1200+ VMs the errors started to creep in.
The errors are mostly related to "ODL L2 agent (i.e., neutron-odlagent-portbinding)" becoming inactive and hence port-binding fails.

Error reported in Rally:
Binding failed for port 000f35ae-9db5-4a9d-9324-255fe7c668d0, please check neutron logs for more information.', u'code': 500, u'created': u'2017-11-30T14:37:43Z'}

It looks like networking-odl marks "ODL L2" agent as down when the information is not returned as part of the following query
http://172.16.0.15:8081/restconf/operational/neutron:neutron/hostconfigs

After the browbeat test completed, we could see that all the "ODL L2" agents were down/inactive.

On the compute nodes, we see that vswitchd connection to ODL Controller is flaky and this could be the reason why the agent info is not returned as part of hostconfig url.
   
   <SNIP>
   2017-11-30T17:40:12.968Z|01049|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connected
   2017-11-30T17:40:12.969Z|01050|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connection closed by peer
   2017-11-30T17:40:12.969Z|01051|rconn|INFO|br-int<->tcp:172.16.0.17:6653: continuing to retry connections in the background but suppressing further logging
   2017-11-30T17:40:22.778Z|01052|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connected
   2017-11-30T17:40:22.778Z|01053|rconn|INFO|br-int<->tcp:172.16.0.17:6653: connection closed by peer
   <SNIP>

Setup:
OSP12 image
ODL running in a non-cluster, non-container setup with 1 controller and 32 compute nodes.
Port-status is disabled in this setup and inactivity_probe/max_backoff are appropriately configured.

RPMs used:
python-networking-odl-11.0.0-2.el7ost.noarch
opendaylight-6.2.0-5.el7ost.noarch

neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------+------------------------------------+-------------------+-------+----------------+------------------------------+
| id                                   | agent_type     | host                               | availability_zone | alive | admin_state_up | binary                       |
+--------------------------------------+----------------+------------------------------------+-------------------+-------+----------------+------------------------------+
| 0201be96-3c80-451a-971e-dfdeb7ebbdd3 | ODL L2         | overcloud-compute-6.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 0283ed5d-b9c8-43c9-9c99-6c27b319b041 | ODL L2         | overcloud-compute-15.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 1ad29e22-cbf8-4384-b1e3-c14e3779a4ae | ODL L2         | overcloud-compute-18.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 29687ccb-aa71-4150-8059-75cdcc98b284 | ODL L2         | overcloud-compute-5.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 29b494c2-bfda-4cdc-b189-1713d66793b0 | ODL L2         | overcloud-compute-22.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 2e981caa-6d6e-4a0d-b716-d3f0d79c4968 | ODL L2         | overcloud-compute-12.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 3928ceb7-fc75-4309-a38d-2a8cad2b5666 | ODL L2         | overcloud-compute-14.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 4011168b-2bda-4bbc-8b3d-0458120b4c13 | ODL L2         | overcloud-compute-1.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 4404a197-73a2-41b7-a905-519a06318af0 | ODL L2         | overcloud-compute-30.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 4d148674-1e2f-4bcd-986a-505b8bc8f84f | ODL L2         | overcloud-compute-19.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 5623cb1c-0dcf-4126-b4f6-f19411e6ec5f | ODL L2         | overcloud-compute-26.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 5703d801-49dc-40d7-adda-e7949a6b6771 | ODL L2         | overcloud-compute-13.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 62d09361-4222-4a0a-8b91-d03dd324f1e7 | ODL L2         | overcloud-controller-0.localdomain |                   | xxx   | True           | neutron-odlagent-portbinding |
| 66edd958-936d-48c9-99e3-4d9241315fe7 | ODL L2         | overcloud-compute-28.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 682cfb7e-9cd0-4cf8-a1c9-ea8c6450ec37 | ODL L2         | overcloud-compute-8.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 72292f55-6a4a-44f2-85a1-ac3640f1b9f0 | ODL L2         | overcloud-compute-9.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 75ae43a1-2dd8-4223-ae34-400bbd3b5ac8 | ODL L2         | overcloud-compute-11.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 7816271e-9c9c-49a6-9cc4-b56c945fd92b | ODL L2         | overcloud-compute-4.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 7b567e37-5ca3-4e3d-91de-31ae56191a2a | ODL L2         | overcloud-compute-25.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 7c7f5637-4242-4015-ac1d-ce0932346519 | ODL L2         | overcloud-compute-21.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 7f7c54d8-028e-458c-8231-c7774b86e1cf | ODL L2         | overcloud-compute-3.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 9270a276-9e0c-42a9-85f5-513bcb9e717a | ODL L2         | overcloud-compute-2.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| 92b3383d-0fd3-4578-94e9-3e67239b86a9 | ODL L2         | overcloud-compute-24.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 94cc6e57-0ffe-4243-ac1d-46f17cee5ed7 | ODL L2         | overcloud-compute-31.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 9fc823e4-146b-417d-9c13-078b11d6e7b9 | ODL L2         | overcloud-compute-20.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| 9feaf78a-f2da-477d-8053-cc599b496aa8 | ODL L2         | overcloud-compute-27.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| a2b758db-06b7-490e-b6d0-d05938848a23 | ODL L2         | overcloud-compute-16.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| a68eb489-1ddd-4ab6-bbc9-931aa892282c | ODL L2         | overcloud-compute-17.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| ace04e9f-a647-471f-92c9-e3f10e036fd7 | ODL L2         | overcloud-compute-10.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| b5fc5492-edf0-4321-bf9f-41f1c541e381 | ODL L2         | overcloud-compute-0.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| bef93822-1da4-48b6-b80e-16733f02c858 | Metadata agent | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent       |
| c085c9cd-412c-4a28-8280-d5521de0099c | ODL L2         | overcloud-compute-29.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| f2fce8b9-7584-4255-ab7e-02475d82826f | ODL L2         | overcloud-compute-23.localdomain   |                   | xxx   | True           | neutron-odlagent-portbinding |
| fc35e8f7-8e39-49df-9824-2447367bd59f | ODL L2         | overcloud-compute-7.localdomain    |                   | xxx   | True           | neutron-odlagent-portbinding |
| fdf5a3be-d458-461b-8e94-e99ecdf5e435 | DHCP agent     | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent           |
+--------------------------------------+----------------+------------------------------------+-------------------+-------+----------------+------------------------------+

Comment 1 Sridhar Gaddam 2017-12-01 17:40:42 UTC

Created attachment 1361711 [details]
ovs-vswitchd log file from one of the compute nodes

Comment 2 Sridhar Gaddam 2017-12-01 17:48:50 UTC

Created attachment 1361712 [details]
Karaf logs while tests were running

Comment 3 Sam Hague 2017-12-06 15:17:52 UTC

OVSDB is very chatty so updates from the ovsdb nodes to ODL cause cpu spikes when processed. Some thoughts to test out the theory:

1. ovsdb table updates that are too frequent. 

- skip-monitoring-manager-status in the southbound.cfg
-- This flag can be used to reduce the number of status updates. It is best to use this in a single-node setup as the default value of true is used in HA setups.

-- Manager table status entry:
https://github.com/opendaylight/ovsdb/commit/5f8ca82c15d75d6c703cef2ded1072c4a588c93b#diff-493334811bc2df8974979e4168cf8fb1

2. What are the specs for the test setup? Hardware or virtual servers? CPUs and RAM? At scale this can have an impact as more and more mdsal processing is done.

3. Think it was mentioned that the ovs nodes are the ones disconnecting. That leads me to think that the ODL side is taking too long to respond and the ovs nodes disconnect.

4. Does disabling bfd help? May just affect tunnels so may not apply, but wasn't sure if ovs related bfd status to the connections.

5. Do the vswitchd logs show why ovs disconnects?

Comment 4 Sridhar Gaddam 2017-12-06 18:38:11 UTC

(In reply to Sam Hague from comment #3)
> OVSDB is very chatty so updates from the ovsdb nodes to ODL cause cpu spikes
> when processed. Some thoughts to test out the theory:
> 
> 1. ovsdb table updates that are too frequent. 
> 
> - skip-monitoring-manager-status in the southbound.cfg
> -- This flag can be used to reduce the number of status updates. It is best
> to use this in a single-node setup as the default value of true is used in
> HA setups.
> 
> -- Manager table status entry:
> https://github.com/opendaylight/ovsdb/commit/
> 5f8ca82c15d75d6c703cef2ded1072c4a588c93b#diff-
> 493334811bc2df8974979e4168cf8fb1
> 

Thanks for the inputs @Sam. Yes, we did configure the skip-monitoring-manager-status = true during our testing.

> 2. What are the specs for the test setup? Hardware or virtual servers? CPUs
> and RAM? At scale this can have an impact as more and more mdsal processing
> is done.

Its all hardware/baremetal Dell R630 servers (2x E5-2683 v3 28 total cores, 56 threads) with 128GB Ram.
> 
> 3. Think it was mentioned that the ovs nodes are the ones disconnecting.
> That leads me to think that the ODL side is taking too long to respond and
> the ovs nodes disconnect.

I think its actually ODL/OVSDB which is closing the connection to the switch.

> 
> 4. Does disabling bfd help? May just affect tunnels so may not apply, but
> wasn't sure if ovs related bfd status to the connections.
> 
> 5. Do the vswitchd logs show why ovs disconnects?
This issue is for a single controller setup and yes there are some connection timed out errors in the log - https://bugzilla.redhat.com/attachment.cgi?id=1361711

Comment 7 Sam Hague 2018-03-07 15:01:42 UTC

Sridhar, would it make sense to run these tests again with port status enabled?

Comment 8 Sridhar Gaddam 2018-03-07 18:03:38 UTC

(In reply to Sam Hague from comment #7)
> Sridhar, would it make sense to run these tests again with port status
> enabled?

IMHO I don't think we have to run the tests again with port-status enabled since its a completely different issue. The OVSDB connection to the Switch is flaky - we have to identify why its so and fix it/fine-tune-it.

Comment 10 Mike Kolesnik 2018-04-08 11:35:31 UTC

Victor, can you please check if this is the same bug or different from bug 1560872 ?

Comment 11 Victor Pickard 2018-04-09 19:14:02 UTC

Hi Mike,
I have requested help from Sai to reproduce this in the scale lab for debugging. 

From looking at bug 1560872, the symptoms sound similiar, but from reading the bug details so far, I don't think they are the same. Bug 1560872 is reproduced by rebooting the compute node, for which there is also an upstream bug:

https://jira.opendaylight.org/browse/OVSDB-444

Comment 12 Sai Sindhur Malleni 2018-04-23 18:57:46 UTC

So we are still seeing this issue when booting a bunch of VMs.

The agent status keeps flapping every few seconds.
https://gist.github.com/smalleni/261f70645f284fc5d7e084f4c2a4033e


I would prefer to have a developer look at the setup when its happening. Also, what would eb the logs of interest? neutron-server logs?

Comment 13 Victor Pickard 2018-04-23 19:52:24 UTC

Sai,
When you get a chance tomorrow to try this again, please enable ovsdb logs:

log:set TRACE org.opendaylight.ovsdb.southbound

From what I saw in the attached ovs logs, it appears that ODL is closing the connection (as Sridhar mentioned). So, lets start by seeing what OVSDB is doing when we get into this situation.

Comment 16 Victor Pickard 2018-05-07 13:44:32 UTC

Sai,
I would like for you to enable the following debug logs on all compute nodes:

sudo ovs-appctl -t ovs-vswitchd vlog/set connmgr
sudo ovs-appctl -t ovs-vswitchd vlog/set rconn

These logs will show transmission and receipt of echo messages from OVS on compute to ODL.

Also, please configure the following on all compute nodes:

sudo ovs-vsctl set Manager 06c2e304-dc68-4b1c-8752-a95121d8d528 inactivity_probe=30000

(Substitute uuid above with actual uuid)

This will change the echo timeout from 5 secs to 30 seconds (from OVS on compute node). 

I suspect that we are encountering a situation where ODL is not able to respond to the echo from OVS on the compute node in a timely manner, under heavy loads. 
So, I want to increase this timer to confirm if the L2 Agent stays up. If this ends up being the case, we will have to:

1. Tune the timer to a reasonable value for scaled deployments
2. Perform more analysis to determine why 5 second default is not enough for echo messages. This may include:
    o Priority message queue for time-sensitive control packets
    o Monitor CPU, to see if ODL is CPU starved during this burst of VM creation
    o Look to see if messages are being dropped
    o Add additional logging to OVSDB (and JsonRpcEndpoint) to add debug logs
      for echo receipt/transmission to include node id information

Comment 21 Sai Sindhur Malleni 2018-05-21 15:00:46 UTC

Removing needinfo since the test with required logging was setup and a debug session was completed leading to some finding. Need to debug further in the future for a resolution.

Comment 22 schitrag 2018-07-17 03:47:15 UTC

Is this issue fixed?

I am facing a similar issue in openstack, kindly let me know if there is a workaround or fix for this issue.

Thanks!

Comment 23 Tim Rozet 2018-07-26 19:12:04 UTC

We need to set a proper default inactivity probe timeout in OOO, hence changing component to THT.

Comment 24 Victor Pickard 2018-07-29 21:28:25 UTC

Created attachment 1471387 [details]
Neutron server debug logs from controller-0

I've ran some additional browbeat rally tests, with the controller inactivity_probe set to 300 secs. Even with that config, I was still seeing dead l2 agents.

I wrote a script to poll the hostconfig every 5 secs from oper d/s on all control nodes, while also monitoring for dead l2 agents in a second script.

What I could see, what that "openstack network agent list" would report dead agents, but the script polling on each controller had valid, accurate hostconfig present. 

Further analysis indicates that the networking-odl periodic task to get hostconfigs from ODL is not running every 30 seconds as is configured. In several cases, this task only ran every 2 mins, 1 sec, which is way too long in between updates, which will result in the l2 agent being marked as dead.


In networking-odl/journal/periodic_task.py, there is a check to make sure the task is executed only after a specified interval. That check is being hit, and as a result, the hostconfig task (and other tasks) are being skipped. 

https://github.com/openstack/networking-odl/blob/stable/queens/networking_odl/journal/periodic_task.py#L96

It's seems possible that the last task execution is not being updated properly, and as a result, the was_periodic_task_executed_recently() is returning an incorrect value.

https://github.com/openstack/networking-odl/blob/stable/queens/networking_odl/db/db.py#L163


I've attached the neutron debug logs.

For reference, here are logs of interest, showing that hostconf periodic task being skipped:


First healthcheck failure
==========================

2018-07-29 19:06:17.580 54 WARNING neutron.db.agents_db [req-cfeb3f33-d4eb-4b77-98a8-135ccfef0ebb - - - - -] Agent healthcheck: found 42 dead agents out of 48:
                Type       Last heartbeat host
              ODL L2  2018-07-29 19:04:55 overcloud-1029pcompute-3.localdomain
              ODL L2  2018-07-29 19:04:55 overcloud-1029pcompute-8.localdomain

2018-07-29 19:08:08.596 54 WARNING neutron.db.agents_db [req-cfeb3f33-d4eb-4b77-98a8-135ccfef0ebb - - - - -] Agent healthcheck: found 42 dead agents out of 48:

Ah, it is taking too long between odl host config updates. 
==========================================================

2 mins, 1 sec = 121 secs.. too long
2018-07-29 19:00:17.533 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs:

2018-07-29 19:02:18.512 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs:

2 mins... too long
2018-07-29 19:04:18.384 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs:


2 mins 1 sec
2018-07-29 19:06:19.996 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs:

Why is it taking so long between ODL hostconfig updates? The hostconfig periodic task is skipping execution because it thinks the task ran recently.

periodic task to get host config
================================
2018-07-29 19:00:17.416 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:00:18.193 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] hostconfig task has been finished

2018-07-29 19:00:47.423 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:00:47.429 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:01:17.419 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:01:17.425 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:01:47.439 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:01:47.445 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:02:17.445 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:02:17.479 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig.
2018-07-29 19:02:17.484 53 INFO networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] REST/GET hostconfigs from ODL
2018-07-29 19:02:19.159 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] hostconfig task has been finished

2018-07-29 19:02:47.469 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:02:47.475 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:03:17.489 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:03:17.495 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:03:47.509 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:03:47.514 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:04:17.526 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:04:17.537 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig.

2018-07-29 19:04:47.543 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:04:47.548 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.



2018-07-29 19:05:17.568 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:05:17.573 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:05:47.592 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:05:47.598 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

2018-07-29 19:06:17.588 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-29 19:06:17.599 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig.
2018-07-29 19:06:17.609 53 INFO networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] REST/GET hostconfigs from ODL

Comment 25 Victor Pickard 2018-07-30 13:44:18 UTC

Here are some additional debug logs showing hostconfig task being skipped on controller-0. Will check other controllers also in the cluster.

[heat-admin@overcloud-controller-0 neutron]$ tail -F server.log | egrep 'Starting hostconfig periodic|Periodic hostconfig task executed after periodic interval Skipping execution|ODLPORTBINDING hostconfigs'
2018-07-30 13:39:50.681 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-30 13:39:50.686 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.
2018-07-30 13:40:20.677 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-30 13:40:20.683 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.
2018-07-30 13:40:50.693 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-30 13:40:50.698 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.
2018-07-30 13:41:20.714 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-30 13:41:20.740 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] ODLPORTBINDING hostconfigs:
2018-07-30 13:41:50.724 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Starting hostconfig periodic task.
2018-07-30 13:41:50.729 53 INFO networking_odl.journal.periodic_task [req-b54a6e7a-04ec-410a-ada6-d6a1cafd6322 - - - - -] Periodic hostconfig task executed after periodic interval Skipping execution.

Comment 26 Victor Pickard 2018-07-31 01:27:30 UTC

Ran the browbeat test again, and monitored networking-odl hostconfig task on all controllers. 

For this run, I had modified the restconf_poll_interval to 15 secs, from the default 30 seconds.

At the time the first dead agent is detected by neutron, the hostconfig task is running on controller-2.

From the logs, it is clear that the hostconfig task took way too long to finish, in this case, 1 min 46 seconds. As a result, agents are marked dead. 



server.log:2018-07-30 23:23:29.662 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Starting _get_and_update_hostconfigs phase of periodic task hostconfig.
server.log:2018-07-30 23:23:29.702 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] ODLPORTBINDING hostconfigs:
server.log:2018-07-30 23:23:53.390 54 DEBUG neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 48 active agents agent_health_check /usr/lib/python2.7/site-packages/neutron/db/agents_db.py:264
server.log:2018-07-30 23:24:30.445 54 WARNING neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 10 dead agents out of 48:
server.log:2018-07-30 23:25:07.434 54 WARNING neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 36 dead agents out of 48:
server.log:2018-07-30 23:25:15.089 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Finished _get_and_update_hostconfigs phase of hostconfig task.
server.log:2018-07-30 23:25:15.111 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] hostconfig task has been finished
server.log:2018-07-30 23:25:15.112 53 INFO networking_odl.journal.periodic_task [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Starting hostconfig periodic task.


I looked at the full neutron log during this time period. I see these db exceptions, with retries. It looks like these exception/retries is what may be the root cause of the hostconfig task taking so long to complete.

018-07-30 23:23:42.497 35 DEBUG neutron.db.api [req-def0a09e-ca89-48c7-af20-8965a9d2689d - - - - -] Retry wrapper got retriable exception: UPDATE statement on table 'opendaylightjournal' expected to update 1 row(s); 0 were matched. wrapped /usr/lib/python2.7/site-packages/neutron/db/api.py:126
2018-07-30 23:23:42.497 35 DEBUG oslo_db.api [req-def0a09e-ca89-48c7-af20-8965a9d2689d - - - - -] Performing DB retry for function networking_odl.db.db.get_oldest_pending_db_row_with_lock wrapper /usr/lib/python2.7/site-packages/oslo_db/api.py:149

Comment 27 Victor Pickard 2018-07-31 01:31:14 UTC

Also, in the neutron logs, the hostconfig is present for the nodes that are being marked as dead. This is just a portion of the hostconfig debug log....

2018-07-30 23:23:29.676 53 DEBUG networking_odl.common.client [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] Sending METHOD (get) URL (http://172.16.0.15:8081/restconf/operational/neutron:neutron/hostconfigs/) JSON (None) request /usr/lib/python2.7/site-packages/networking_odl/common/client.py:89
2018-07-30 23:23:29.702 53 DEBUG networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] ODLPORTBINDING hostconfigs:
[
    "{\"hostconfigs\":{\"hostconfig\":[{\"host-id\":\"overcloud-6018rcompute-3.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{  \\\"supported_vn",
    "ic_types\\\": [{    \\\"vnic_type\\\": \\\"normal\\\",    \\\"vif_type\\\": \\\"ovs\\\",    \\\"vif_details\\\": {}  }],  \\\"allowed_network_types\\\": [",
    "\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"],  \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-1029pc",
    "ompute-20.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{  \\\"supported_vnic_types\\\": [{    \\\"vnic_type\\\": \\\"normal\\\",    \\\"vif_typ",
    "e\\\": \\\"ovs\\\",    \\\"vif_details\\\": {}  }],  \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"],  \\\"bridge",
    "_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-1029pcompute-10.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{  ",
    "\\\"supported_vnic_types\\\": [{    \\\"vnic_type\\\": \\\"normal\\\",    \\\"vif_type\\\": \\\"ovs\\\",    \\\"vif_details\\\": {}  }],  \\\"allowed_netw",
    "ork_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"],  \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"ov",
    "ercloud-1029pcompute-21.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{  \\\"supported_vnic_types\\\": [{    \\\"vnic_type\\\": \\\"normal\\\"",
    ",    \\\"vif_type\\\": \\\"ovs\\\",    \\\"vif_details\\\": {}  }],  \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre",
    "\\\"],  \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-1029pcompute-22.localdomain\",\"host-type\":\"ODL L2\"",
    ",\"config\":\"{  \\\"supported_vnic_types\\\": [{    \\\"vnic_type\\\": \\\"normal\\\",    \\\"vif_type\\\": \\\"ovs\\\",    \\\"vif_details\\\": {}  }],  ",
    "\\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"],  \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},",
    "{\"host-id\":\"overcloud-1029pcompute-11.localdomain\",\"host-type\":\"ODL L2\",\"config\":\"{  \\\"supported_vnic_types\\\": [{    \\\"vnic_type",
    "\\\": \\\"normal\\\",    \\\"vif_type\\\": \\\"ovs\\\",    \\\"vif_details\\\": {}  }],  \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\",
    "\"vxlan\\\",\\\"gre\\\"],  \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"br-ex\\\"}}\"},{\"host-id\":\"overcloud-r620compute-1.localdomain\",\"host-ty",
    "pe\":\"ODL L2\",\"config\":\"{  \\\"supported_vnic_types\\\": [{    \\\"vnic_type\\\": \\\"normal\\\",    \\\"vif_type\\\": \\\"ovs\\\",    \\\"vif_details\\",
    "\": {}  }],  \\\"allowed_network_types\\\": [\\\"local\\\",\\\"flat\\\",\\\"vlan\\\",\\\"vxlan\\\",\\\"gre\\\"],  \\\"bridge_mappings\\\": {\\\"datacentre\\\":\\\"",

Comment 28 Victor Pickard 2018-07-31 01:44:52 UTC

Created attachment 1471656 [details]
neutron hostconfig controller-0 log

Comment 29 Victor Pickard 2018-07-31 01:45:55 UTC

Created attachment 1471657 [details]
neutron hostconfig controller-1 log

Comment 30 Victor Pickard 2018-07-31 01:46:55 UTC

Created attachment 1471658 [details]
neutron hostconfig controller-2 log

Comment 31 Victor Pickard 2018-07-31 01:49:56 UTC

Here is the egrep filter I used to parse the neutron logs to get the relevant hostconfig entries:

egrep --line-buffered 'Starting hostconfig periodic|Periodic hostconfig task executed after periodic interval Skipping execution|ODLPORTBINDING hostconfigs|already running|hostconfig task has been finished|_get_and_update_hostconfigs|healthcheck' server.lo* > hostconfig-x.txt

Comment 33 Mike Kolesnik 2018-08-09 08:57:36 UTC

Please see [1] for the proposed solution. We'll use that bug to track down the timeouts and this one to improve the networking-odl resiliency WRT the "L2 agents".

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1610546#c7

Comment 34 Michel Peterson 2018-08-09 13:33:14 UTC

Summary of the investigation done last week.

n-odl specific
==============

1. networking-odl hostconfig long agent db update cycle (referred to Vic in his original email)
* The sources that relate to this problem can be many, but the thing is that the issue arises from a chain of causality. Let me explain, if there is an increase of I/O operations, CPU load, etc that affect the system (i.e. the DB), then the n-odl journal starts to grow. Something that evidences this issue is the RetryDBException that Vic referred. When the journal fills up, this adds an additional load to the DB server and in turn it accentuates the problem. Getting us into a negative cycle that starts easing up as the the journal is able to clear up. In general, the journal is a poor design for the needs we have, and that's why I am working in a solution that instead of scaling by operations, it scales by resource. This solution, which will only be available in OSP15, will be the next version of the driver and until then there is not much we can do about it this issue.
2. ODL REST API Timeout / Exception when retrieving hostconfig
* The first scenario is where we have a timeout. Currently it won't do anything, so if there is a timeout because the REST API is having some issues, then it will affect the aliveness of the agents. I think in this case we could modify the code in n-odl so if there is a timeout we use cached entries to update the aliveness in Neutron's agent db, as the nodes (agents in Neutron lingo) are probably not dead. If any of the nodes are dead, then it will fail on scheduling a VM there and then it will be retried anyways, so it's not a big issue.
* The second scenario is where we have an exception, e.g. connection refused. In these cases I think the best way is to handle it as it is, as probably ODL is dead or unresponsive, so even if the nodes are up the system won't be able to process things as expected and we might as well leave the agents be marked as dead.
* There is a solution for this, however, and is using websockets instead of the REST API. That would mean that instead of polling status we would be pushing it. The code is already there but was never properly tested. As it can be enabled with a configuration change it is a solution that can make this cycle. Therefore, we decided together with Sai, Mike and Shridhar to try a deploy with this configuration and see if we see an improvement over the REST API mechanism.
3. The "timers nightmare"
* There are 2 timers in play (there is actually more as Vic pointed out but for simplicity sake), one in Neutron and one in n-odl. As it stands, these timers need to be fine tuned together as a too low a value would add load to the server and too high a value would make the detection of dead nodes not reliable. Anyways, as we won't change the value for Neutron, as it affects all mechanisms and drivers, we need to play around with the one of n-odl, which anyways is dependent on the one in Neutron. There isn't much to say or do about this issue, but I think it would be a good improvement for the future to instead of making the n-odl configurable independently, we should make the config be a divider of Neutron's configured value (e.g. if we want to have 2 opportunities to update the agent db in Neutron before it gets checked (current default for Neutron is now 35 seconds), we would configure a divider of 2 and then the n-odl periodic task would run every 17 seconds). Said that, this is something that now, and in the future, will be a possible point of failure as the mechanism to update the agents is per host node and each transaction takes some time. As we scale up in number of nodes this might be a problem so perhaps we should request improvements be done in Neutron (e.g. bulk update)
4. Trivial problem: hostconfig might not be the best indication as to whether a node is up/down.
* ovsdb only has hostconfig and that's what we are reporting through the API
* but the presence of an openflow connection is actually the right indication for node aliveness.

OpenStack in general
====================

1. OpenStack services "hogging" the system
* We have seen this in the past where Telemetry added a huge load to the controller nodes.
* This issue is obviously related to the 1st one of n-odl specific problems.
* A possible solution for the future is to put the ODL controller on a node of it's own and, thus, reduce the dependency of shared resources.
2. Agents are not marked/reactivated automatically again for Nova to schedule VMs
* Perhaps Sai can expand here, as he explained this and he has more details he can provide.
* This could potentially mean that even if we mark a node as alive again, the impact on the system remains until an operator manually mark the nodes as cleared to operate normally.

Comment 35 Michel Peterson 2018-09-05 10:14:39 UTC

Fix submitted as we agreed. This fix ignores agent aliveness and try to bind the port disregarding that. The consequence of this is that a "neutron agent-list" call might show the agent as dead or alive but that's not information we should depend on to understand if we are binding to that node.

Comment 36 Michel Peterson 2018-09-12 14:33:33 UTC

Fix merged upstream. Cherry picked for stable/queens [1] and stable/rocky [2]. 

[1] https://review.openstack.org/#/c/602046/
[2] https://review.openstack.org/#/c/602048/

Comment 37 Michel Peterson 2018-10-02 07:49:38 UTC

Fix also merged on stable/rocky. Waiting for it to merge on stable/queens.

Comment 38 Michel Peterson 2018-10-03 12:24:25 UTC

Merged on stable/queens

Comment 46 errata-xmlrpc 2018-11-13 23:32:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3614

Note You need to log in before you can comment on or make changes to this bug.