Description of problem: Restconf timeouts (10 seconds) when networking-odl is fetching hostconfig from ODL. Version-Release number of selected component (if applicable): How reproducible: Intermittently on scale setup. Steps to Reproduce: 1.Deploy scale lab on cloud20 as documented in Sai's doc 2. Run browbeat rally scenario 3. Monitor for dead l2 agents 4. Analyze neutron debug logs and look for dead agent logs and restconf timeouts, like this one: 2018-07-31 20:12:10.119 53 WARNING networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] REST/GET odl hostconfig failed, : ReadTimeout: HTTPConnectionPool(host='172.16.0.15', port=8081): Read timed out. (read timeout=10) 2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding Traceback (most recent call last): 2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding File "/usr/lib/python2.7/site-packages/networking_odl/ml2/pseudo_agentdb_binding.py", line 62, in _rest_get_hostconfigs 2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding response = self.odl_rest_client.get() Actual results: 2018-07-31 20:12:10.119 53 WARNING networking_odl.ml2.pseudo_agentdb_binding [req-cf3a917f-ecb3-4b1b-85d5-1bf475f184b8 - - - - -] REST/GET odl hostconfig failed, : ReadTimeout: HTTPConnectionPool(host='172.16.0.15', port=8081): Read timed out. (read timeout=10) 2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding Traceback (most recent call last): 2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding File "/usr/lib/python2.7/site-packages/networking_odl/ml2/pseudo_agentdb_binding.py", line 62, in _rest_get_hostconfigs 2018-07-31 20:12:10.119 53 ERROR networking_odl.ml2.pseudo_agentdb_binding response = self.odl_rest_client.get() 2018-07-31 20:12:10.451 54 WARNING neutron.db.agents_db [req-e051c63d-5758-483b-8419-9b8d8505844c - - - - -] Agent healthcheck: found 42 dead agents out of 48: Type Last heartbeat host ODL L2 2018-07-31 20:10:28 overcloud-1029pcompute-3.localdomain ODL L2 2018-07-31 20:10:29 overcloud-1029pcompute-8.localdomain ODL L2 2018-07-31 20:10:28 overcloud-1029pcompute-9.localdomain ODL L2 2018-07-31 20:10:28 overcloud-1029pcompute-2.localdomain ODL L2 2018-07-31 20:10:28 overcloud-6018rcompute-1.localdomain Expected results: Expect no http timeouts when networking-odl is reading hostconfig from ODL. Expect to see no dead l2 agent (openstack network list agent) Additional info: The timeouts occurred on controller-2 for this session. Will attach neutron logs from each controller.
Created attachment 1471941 [details] neutron log controller-0
Created attachment 1471942 [details] neutron log controller-1
Created attachment 1471943 [details] neutron log controller-2
This is quite a big issue which we don't have a good solution for yet, it probably won't be ready for z3 so moving to z4
*** Bug 1610879 has been marked as a duplicate of this bug. ***
Currently the proposed solution is to "disable" the "aliveness" timer so that the agents list is updated based on what's reported by ODL. Since ODL reports the "agents" it is the source of truth and can decide if an agent is "alive" or "dead" and so we'll take that information and simply reflect it on Neutron side (which is the implementation today). Hence what's needed to fix this bug is to tweak the agent_down_time value in neutron.conf to something like 999999999 (~25 years). It also makes sense to increase restconf_poll_interval in ml2_plugin.ini to around 120 seconds to lower the amount of polling being done.
*** Bug 1610889 has been marked as a duplicate of this bug. ***
Since bug 1519925 is already open for the same issue let's use that one to tackle the resiliency of the L2 "agents" mechanism, and this bug to track down the cause for the timeouts.
The REST API timeouts are not only seen for hostconfigs but also when n-odl is trying to update any neutron resources to Netvirt. server.log.1:33657:2018-08-19 15:41:36.225 32 ERROR networking_odl.common.client [req-b62fbaad-4473-4908-b1b4-47aca11f7096 - - - - -] REST request ( post ) to url ( subnets ) is failed. Request body : [{u'subnet': {'updated_at': '2018-08-19T15:41:25Z', 'ipv6_ra_mode': None, 'allocation_pools': [{'start': '10.2.187.2', 'end': '10.2.187.254'}], 'host_routes': [], 'revision_number': 0, 'ipv6_address_mode': None, 'id': '3cbba7e3-f0c8-40e7-bc7e-a85bc6ec9386', 'dns_nameservers': [], 'gateway_ip': '10.2.187.1', 'shared': False, 'project_id': u'7f23d07152cb4714a50c33d65bd4be8f', 'description': u'', 'tags': [], 'cidr': '10.2.187.0/24', 'service_types': [], 'name': u's_rally_91155bdf_DVXudy1l', 'enable_dhcp': True, 'network_id': 'd558e136-9e46-4752-9b4c-52a811d896fe', 'tenant_id': u'7f23d07152cb4714a50c33d65bd4be8f', 'created_at': '2018-08-19T15:41:25Z', 'ip_version': 4}}] service: ReadTimeout: HTTPConnectionPool(host='172.16.0.11', port=8081): Read timed out. (read timeout=10)
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality