Test jobs with 3 controller ml/ovs (l3 ha) failed to ssh the machines. DHCP: ok metadata : failed l3 tcp connect: failed (ssh fails with connect not auth) Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20200414.n.0 tripleo-ansible 0.5.1-0.20200413163419.943fadb.el8ost python3-neutron-lib-1.29.1-0.20200310214054.4ef4b71.el8ost.noarch openstack-neutron-common-15.0.3-0.20200414163432.eeabba8.el8ost.noarch How reproducible: Allays Similar errors visible: https://bugzilla.redhat.com/show_bug.cgi?id=1817926#c4
Indeed, DHCP looks good: Starting network... udhcpc (v1.23.2) started Sending discover... Sending select for 10.100.0.8... Lease of 10.100.0.8 obtained, lease time 86400 Metadata failing: checking http://169.254.169.254/2009-04-04/instance-id failed 1/20: up 3.59. request failed failed 2/20: up 8.60. request failed failed 3/20: up 11.60. request failed failed 4/20: up 16.61. request failed failed 5/20: up 19.61. request failed failed 6/20: up 24.62. request failed failed 7/20: up 27.62. request failed failed 8/20: up 32.63. request failed failed 9/20: up 35.62. request failed failed 10/20: up 40.63. request failed failed 11/20: up 43.63. request failed failed 12/20: up 48.64. request failed failed 13/20: up 51.64. request failed failed 14/20: up 56.65. request failed failed 15/20: up 59.65. request failed failed 16/20: up 64.66. request failed failed 17/20: up 67.66. request failed failed 18/20: up 72.67. request failed failed 19/20: up 75.67. request failed failed 20/20: up 80.68. request failed failed to read iid from metadata. tried 20 failed to get instance-id of datasource So then for connectivity, test is SSH, where login will fail without metadata
Created attachment 1679272 [details] HA router update issue log An extract from controller-2 l3-agent logfile 2020-04-15 14:50:43.274 123661 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 39c0977f-b8ee-4d1b-bf42-ea327edc155a _update_metadata_proxy /usr/lib/python3.6/site-packages/neutron/agent/l3/ha.py:219 2020-04-15 14:50:43.275 123661 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/39c0977f-b8ee-4d1b-bf42-ea327edc155a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 Then *-kill 9/15 commands and deleting router operations on router update?, with backtraces like: TypeError: Object of type 'ValueError' is not JSON serializable FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/neutron/ha_confs/39c0977f-b8ee-4d1b-bf42-ea327edc155a' In the middle, haproxy is reported a few times as not found: No haproxy process started for 39c0977f-b8ee-4d1b-bf42-ea327edc155a disable And in the end, update hits retry limit: 020-04-15 14:50:49.699 123661 WARNING neutron.agent.l3.agent [-] Hit retry limit with router update for 39c0977f-b8ee-4d1b-bf42-ea327edc155a, action 1 2020-04-15 14:50:49.699 123661 INFO neutron.agent.l3.agent [-] Finished a router update for 39c0977f-b8ee-4d1b-bf42-ea327edc155a, update_id 6b236712-bcdd-40d6-8375-bbbb1e266d6b. Time elapsed: 0.004 ~ ~
This should have something to: 2020-04-15 14:50:47.879 130335 ERROR neutron.agent.l3.agent [-] Error while deleting router 39c0977f-b8ee-4d1b-bf42-ea327edc155a: multiprocessing.managers.RemoteError: --------------------------------------------------------------------------- Unserializable message: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/managers.py", line 283, in serve_client send(msg) File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 128, in send s = self.dumps(obj) File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 170, in dumps return json.dumps(obj, cls=RpcJSONEncoder).encode('utf-8') File "/usr/lib64/python3.6/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 43, in default return super(RpcJSONEncoder, self).default(o) File "/usr/lib64/python3.6/json/encoder.py", line 180, in default o.__class__.__name__) TypeError: Object of type 'ValueError' is not JSON serializable ... 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent [-] Error while deleting router 39c0977f-b8ee-4d1b-bf42-ea327edc155a: FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/neutron/ ha_confs/39c0977f-b8ee-4d1b-bf42-ea327edc155a' 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent Traceback (most recent call last): 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 506, in _safe_router_removed 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent self._router_removed(ri, router_id) 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 542, in _router_removed 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent self.router_info[router_id] = ri 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent self.force_reraise() 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent six.reraise(self.type_, self.value, self.tb) 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent raise value 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 539, in _router_removed 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent ri.delete() 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/ha_router.py", line 491, in delete 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent self.disable_keepalived() 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib/python3.6/site-packages/neutron/agent/l3/ha_router.py", line 202, in disable_keepalived 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent shutil.rmtree(conf_dir) 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib64/python3.6/shutil.py", line 477, in rmtree 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent onerror(os.lstat, path, sys.exc_info()) 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent File "/usr/lib64/python3.6/shutil.py", line 475, in rmtree 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent orig_st = os.lstat(path) 2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/neutron/ha_confs/39c0977f-b8ee-4d1b-bf42-ea327edc155a' in controller-[012]/var/log/containers/neutron/l3-agent.log.1 SSH issues in tempest logs seems were happening not so much later after that, 2020-04-15 15:04:38. There are more suspicious messages around that period: 2020-04-15T15:04:36.438 nova-api 24 ERROR oslo.messaging._drivers.impl_rabbit [-] [04aee1ed-f903-4d09-9bec-726a2d3e24af] AMQP server on controller-1.internalapi.redhat.local:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> 2020-04-15T15:04:41.279 nova-api 25 ERROR oslo.messaging._drivers.impl_rabbit [-] [a3904c17-de76-4b50-bea7-6ef379d2ae67] AMQP server on controller-0.internalapi.redhat.local:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> 2020-04-15T15:08:59 mysqld 13748 [ERROR] InnoDB: WSREP: referenced FK check fail: Lock wait index `PRIMARY` table `ovs_neutron`.`provisioningblocks` 2020-04-15T15:09:53 mysqld 4196 [ERROR] InnoDB: WSREP: referenced FK check fail: Lock wait index `PRIMARY` table `ovs_neutron`.`provisioningblocks` 2020-04-15T15:09:56 mysqld 4173 [ERROR] InnoDB: WSREP: referenced FK check fail: Lock wait index `PRIMARY` table `ovs_neutron`.`provisioningblocks` 2020-04-15T15:10:47.156 dhcp-agent 123028 ERROR neutron.agent.linux.dhcp [req-7b111235-5353-4053-9bc0-30dc8ab35615 - - - - -] Failed to start DHCP process for network 2d214734-14e2-4611-a935-413b4bbf79eb: neutron.common.utils.WaitTimeout: Timed out after 60 seconds and Apr 15 14:50:41 messages controller-2 Keepalived_vrrp[481602]: (VR_5) Receive advertisement timeout Apr 15 15:12:16 messages controller-1 Keepalived_vrrp[613342]: (VR_4) Receive advertisement timeout
Perhaps https://review.opendev.org/#/q/I31237d21527a2909a1669cb6c80cc0fa9be798a6 has been addressing this?
In L3 agent logs I see a lot of errors like: 2020-04-15 14:50:47.874 130335 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['radvd-kill', '9', '502528'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103 2020-04-15 14:50:47.879 130335 ERROR neutron.agent.linux.utils [-] Rootwrap error running command: ['radvd-kill', '9', '502528']: multiprocessing.managers.RemoteError: 2020-04-15 14:50:47.879 130335 ERROR neutron.agent.l3.agent [-] Error while deleting router 39c0977f-b8ee-4d1b-bf42-ea327edc155a: multiprocessing.managers.RemoteError: --------------------------------------------------------------------------- Unserializable message: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/managers.py", line 283, in serve_client send(msg) File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 128, in send s = self.dumps(obj) File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 170, in dumps return json.dumps(obj, cls=RpcJSONEncoder).encode('utf-8') File "/usr/lib64/python3.6/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 43, in default return super(RpcJSONEncoder, self).default(o) File "/usr/lib64/python3.6/json/encoder.py", line 180, in default o.__class__.__name__) TypeError: Object of type 'ValueError' is not JSON serializable --------------------------------------------------------------------------- Which leads me to http://bugzilla.redhat.com/1823795 which has similar issues. I just proposed patch https://review.opendev.org/720431 to fix this issue. Patch is very simple, it's just addition of one line in /etc/neutron/rootwrap.d/l3.filters file and it should be applied inside neutron-l3-agent containers. Can we maybe try to apply such change manually in some environment and try if those tests will pass then?
Adding dep bugs, once both are fixed it should help here but keeping opened for tracking
Adding another recent bug, this one causes limitations on spawned dnsmasq processes so it may be visible too in VM connectivity errors
Updating status, this will be fixed by upstream stable/train revert https://review.opendev.org/#/c/725162/ (waiting on upstream gates)