Bug 1610546 - [Scale][Netvirt] REST calls under load time out after 10 seconds
Summary: [Scale][Netvirt] REST calls under load time out after 10 seconds
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z5
: 13.0 (Queens)
Assignee: Sridhar Gaddam
QA Contact: Noam Manos
URL:
Whiteboard: Netvirt
: 1610879 1610889 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-31 21:44 UTC by Victor Pickard
Modified: 2019-03-06 16:17 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-06 16:16:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
neutron log controller-0 (4.87 MB, application/x-gzip)
2018-07-31 21:56 UTC, Victor Pickard
no flags Details
neutron log controller-1 (4.94 MB, application/x-gzip)
2018-07-31 21:56 UTC, Victor Pickard
no flags Details
neutron log controller-2 (4.75 MB, application/x-gzip)
2018-07-31 21:57 UTC, Victor Pickard
no flags Details

Description Victor Pickard 2018-07-31 21:44:16 UTC
Description of problem:

Restconf timeouts (10 seconds) when networking-odl is fetching hostconfig from ODL.


Version-Release number of selected component (if applicable):


How reproducible:

Intermittently on scale setup.


Steps to Reproduce:
1.Deploy scale lab on cloud20 as documented in Sai's doc
2. Run browbeat rally scenario
3. Monitor for dead l2 agents
4. Analyze neutron debug logs and look for dead agent logs and restconf timeouts, like this one:


2018-07-31 20:12:10.119 53 WARNING networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] REST/GET odl hostconfig failed, : ReadTimeout: HTTPConnectionPool(host='172.16.0.15', port=8081): Read timed out. (read timeout=10)
2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding Traceback (most recent call last):
2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding   File "/usr/lib/python2.7/site-packages/networking_odl/ml2/pseudo_agentdb_binding.py", line 62, in _rest_get_hostconfigs
2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding     response = self.odl_rest_client.get()




Actual results:

2018-07-31 20:12:10.119 53 WARNING networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] REST/GET odl hostconfig failed, : ReadTimeout: HTTPConnectionPool(host='172.16.0.15', port=8081): Read timed out. (read timeout=10)
2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding Traceback (most recent call last):
2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding   File "/usr/lib/python2.7/site-packages/networking_odl/ml2/pseudo_agentdb_binding.py", line 62, in _rest_get_hostconfigs
2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding     response = self.odl_rest_client.get()



2018-07-31 20:12:10.451 54 WARNING neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 42 dead agents out of 48:
                Type       Last heartbeat host
              ODL L2  2018-07-31 20:10:28 overcloud-1029pcompute-3.localdomain
              ODL L2  2018-07-31 20:10:29 overcloud-1029pcompute-8.localdomain
              ODL L2  2018-07-31 20:10:28 overcloud-1029pcompute-9.localdomain
              ODL L2  2018-07-31 20:10:28 overcloud-1029pcompute-2.localdomain
              ODL L2  2018-07-31 20:10:28 overcloud-6018rcompute-1.localdomain



Expected results:

Expect no http timeouts when networking-odl is reading hostconfig from ODL.

Expect to see no dead l2 agent (openstack network list agent)


Additional info:

The timeouts occurred on controller-2 for this session.

Will attach neutron logs from each controller.

Comment 1 Victor Pickard 2018-07-31 21:56:06 UTC
Created attachment 1471941 [details]
neutron log controller-0

Comment 2 Victor Pickard 2018-07-31 21:56:57 UTC
Created attachment 1471942 [details]
neutron log controller-1

Comment 3 Victor Pickard 2018-07-31 21:57:38 UTC
Created attachment 1471943 [details]
neutron log controller-2

Comment 5 Mike Kolesnik 2018-08-06 10:25:10 UTC
This is quite a big issue which we don't have a good solution for yet, it probably won't be ready for z3 so moving to z4

Comment 6 Mike Kolesnik 2018-08-07 09:04:07 UTC
*** Bug 1610879 has been marked as a duplicate of this bug. ***

Comment 7 Mike Kolesnik 2018-08-07 10:29:54 UTC
Currently the proposed solution is to "disable" the "aliveness" timer so that the agents list is updated based on what's reported by ODL.

Since ODL reports the "agents" it is the source of truth and can decide if an agent is "alive" or "dead" and so we'll take that information and simply reflect it on Neutron side (which is the implementation today).

Hence what's needed to fix this bug is to tweak the agent_down_time value in neutron.conf to something like 999999999 (~25 years).

It also makes sense to increase restconf_poll_interval in ml2_plugin.ini to around 120 seconds to lower the amount of polling being done.

Comment 9 Mike Kolesnik 2018-08-07 11:18:57 UTC
*** Bug 1610889 has been marked as a duplicate of this bug. ***

Comment 10 Mike Kolesnik 2018-08-09 08:55:03 UTC
Since bug 1519925 is already open for the same issue let's use that one to tackle the resiliency of the L2 "agents" mechanism, and this bug to track down the cause for the timeouts.

Comment 12 Sridhar Gaddam 2018-08-27 14:56:37 UTC
The REST API timeouts are not only seen for hostconfigs but also when n-odl is trying to update any neutron resources to Netvirt.

server.log.1:33657:2018-08-19 15:41:36.225 32 ERROR networking_odl.common.client [req-b62fbaad-4473-4908-b1b4-47aca11f7096 - - - - -] REST request ( post ) to url ( subnets ) is failed. Request body : [{u'subnet': {'updated_at': '2018-08-19T15:41:25Z', 'ipv6_ra_mode': None, 'allocation_pools': [{'start': '10.2.187.2', 'end': '10.2.187.254'}], 'host_routes': [], 'revision_number': 0, 'ipv6_address_mode': None, 'id': '3cbba7e3-f0c8-40e7-bc7e-a85bc6ec9386', 'dns_nameservers': [], 'gateway_ip': '10.2.187.1', 'shared': False, 'project_id': u'7f23d07152cb4714a50c33d65bd4be8f', 'description': u'', 'tags': [], 'cidr': '10.2.187.0/24', 'service_types': [], 'name': u's_rally_91155bdf_DVXudy1l', 'enable_dhcp': True, 'network_id': 'd558e136-9e46-4752-9b4c-52a811d896fe', 'tenant_id': u'7f23d07152cb4714a50c33d65bd4be8f', 'created_at': '2018-08-19T15:41:25Z', 'ip_version': 4}}] service: ReadTimeout: HTTPConnectionPool(host='172.16.0.11', port=8081): Read timed out. (read timeout=10)

Comment 18 Franck Baudin 2019-03-06 16:16:53 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 19 Franck Baudin 2019-03-06 16:17:51 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality


Note You need to log in before you can comment on or make changes to this bug.