Bug 1824397 - 16.1 ha l3 connectivity issue
Summary: 16.1 ha l3 connectivity issue
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 16.1 (Train on RHEL 8.2)
Assignee: Bernard Cafarelli
QA Contact: Alex Katz
URL:
Whiteboard:
Depends On: 1825233 1827072 1827276
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-16 06:07 UTC by Attila Fazekas
Modified: 2021-02-05 12:40 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200513033425.a90c03e.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-05 12:40:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
HA router update issue log (46.17 KB, text/plain)
2020-04-16 08:46 UTC, Bernard Cafarelli
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 725162 0 None MERGED Revert systemd sidecars 2021-02-08 07:54:34 UTC

Description Attila Fazekas 2020-04-16 06:07:48 UTC
Test jobs with 3 controller ml/ovs (l3 ha) failed to ssh the machines.

DHCP: ok
metadata : failed
l3 tcp connect: failed (ssh fails with connect not auth)


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20200414.n.0
tripleo-ansible    0.5.1-0.20200413163419.943fadb.el8ost
python3-neutron-lib-1.29.1-0.20200310214054.4ef4b71.el8ost.noarch
openstack-neutron-common-15.0.3-0.20200414163432.eeabba8.el8ost.noarch

How reproducible:
Allays

Similar errors visible:
https://bugzilla.redhat.com/show_bug.cgi?id=1817926#c4

Comment 2 Bernard Cafarelli 2020-04-16 08:21:57 UTC
Indeed, DHCP looks good:
Starting network...
udhcpc (v1.23.2) started
Sending discover...
Sending select for 10.100.0.8...
Lease of 10.100.0.8 obtained, lease time 86400

Metadata failing:
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 3.59. request failed
failed 2/20: up 8.60. request failed
failed 3/20: up 11.60. request failed
failed 4/20: up 16.61. request failed
failed 5/20: up 19.61. request failed
failed 6/20: up 24.62. request failed
failed 7/20: up 27.62. request failed
failed 8/20: up 32.63. request failed
failed 9/20: up 35.62. request failed
failed 10/20: up 40.63. request failed
failed 11/20: up 43.63. request failed
failed 12/20: up 48.64. request failed
failed 13/20: up 51.64. request failed
failed 14/20: up 56.65. request failed
failed 15/20: up 59.65. request failed
failed 16/20: up 64.66. request failed
failed 17/20: up 67.66. request failed
failed 18/20: up 72.67. request failed
failed 19/20: up 75.67. request failed
failed 20/20: up 80.68. request failed
failed to read iid from metadata. tried 20
failed to get instance-id of datasource

So then for connectivity, test is SSH, where login will fail without metadata

Comment 3 Bernard Cafarelli 2020-04-16 08:46:28 UTC
Created attachment 1679272 [details]
HA router update issue log

An extract from controller-2 l3-agent logfile

2020-04-15 14:50:43.274 123661 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 39c0977f-b8ee-4d1b-bf42-ea327edc155a _update_metadata_proxy /usr/lib/python3.6/site-packages/neutron/agent/l3/ha.py:219
2020-04-15 14:50:43.275 123661 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/39c0977f-b8ee-4d1b-bf42-ea327edc155a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261

Then *-kill 9/15 commands and deleting router operations on router update?, with backtraces like:
TypeError: Object of type 'ValueError' is not JSON serializable
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/neutron/ha_confs/39c0977f-b8ee-4d1b-bf42-ea327edc155a'

In the middle, haproxy is reported a few times as not found:
No haproxy process started for 39c0977f-b8ee-4d1b-bf42-ea327edc155a disable

And in the end, update hits retry limit:
020-04-15 14:50:49.699 123661 WARNING neutron.agent.l3.agent [-] Hit retry limit with router update for 39c0977f-b8ee-4d1b-bf42-ea327edc155a, action 1
2020-04-15 14:50:49.699 123661 INFO neutron.agent.l3.agent [-] Finished a router update for 39c0977f-b8ee-4d1b-bf42-ea327edc155a, update_id 6b236712-bcdd-40d6-8375-bbbb1e266d6b. Time elapsed: 0.004
~                                                                                                                                                                                             
~

Comment 4 Bogdan Dobrelya 2020-04-16 09:10:50 UTC
This should have something to:

2020-04-15 14:50:47.879 130335 ERROR neutron.agent.l3.agent [-] Error while deleting router 39c0977f-b8ee-4d1b-bf42-ea327edc155a: multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 283, in serve_client
    send(msg)
  File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 128, in send
    s = self.dumps(obj)
  File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 170, in dumps
    return json.dumps(obj, cls=RpcJSONEncoder).encode('utf-8')
  File "/usr/lib64/python3.6/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 43, in default
    return super(RpcJSONEncoder, self).default(o)
  File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'ValueError' is not JSON serializable
...
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent [-] Error while deleting router 39c0977f-b8ee-4d1b-bf42-ea327edc155a: FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/neutron/
ha_confs/39c0977f-b8ee-4d1b-bf42-ea327edc155a'
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 506, in _safe_router_removed
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     self._router_removed(ri, router_id)
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 542, in _router_removed
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     self.router_info[router_id] = ri
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     self.force_reraise()
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     six.reraise(self.type_, self.value, self.tb)
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     raise value
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py", line 539, in _router_removed
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     ri.delete()
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/l3/ha_router.py", line 491, in delete
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     self.disable_keepalived()
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/l3/ha_router.py", line 202, in disable_keepalived
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     shutil.rmtree(conf_dir)
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib64/python3.6/shutil.py", line 477, in rmtree
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     onerror(os.lstat, path, sys.exc_info())
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent   File "/usr/lib64/python3.6/shutil.py", line 475, in rmtree
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent     orig_st = os.lstat(path)
2020-04-15 14:50:48.442 130335 ERROR neutron.agent.l3.agent FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/neutron/ha_confs/39c0977f-b8ee-4d1b-bf42-ea327edc155a'

in controller-[012]/var/log/containers/neutron/l3-agent.log.1

SSH issues in tempest logs seems were happening not so much later after that, 2020-04-15 15:04:38.

There are more suspicious messages around that period:

       2020-04-15T15:04:36.438              nova-api                              24 ERROR oslo.messaging._drivers.impl_rabbit [-] [04aee1ed-f903-4d09-9bec-726a2d3e24af] AMQP server on controller-1.internalapi.redhat.local:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
       2020-04-15T15:04:41.279              nova-api                              25 ERROR oslo.messaging._drivers.impl_rabbit [-] [a3904c17-de76-4b50-bea7-6ef379d2ae67] AMQP server on controller-0.internalapi.redhat.local:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
           2020-04-15T15:08:59                mysqld                              13748 [ERROR] InnoDB: WSREP: referenced FK check fail: Lock wait index `PRIMARY` table `ovs_neutron`.`provisioningblocks`
           2020-04-15T15:09:53                mysqld                              4196 [ERROR] InnoDB: WSREP: referenced FK check fail: Lock wait index `PRIMARY` table `ovs_neutron`.`provisioningblocks`
           2020-04-15T15:09:56                mysqld                              4173 [ERROR] InnoDB: WSREP: referenced FK check fail: Lock wait index `PRIMARY` table `ovs_neutron`.`provisioningblocks`
       2020-04-15T15:10:47.156            dhcp-agent                              123028 ERROR neutron.agent.linux.dhcp [req-7b111235-5353-4053-9bc0-30dc8ab35615 - - - - -] Failed to start DHCP process for network 2d214734-14e2-4611-a935-413b4bbf79eb: neutron.common.utils.WaitTimeout: Timed out after 60 seconds

and

Apr 15 14:50:41              messages                              controller-2 Keepalived_vrrp[481602]: (VR_5) Receive advertisement timeout
Apr 15 15:12:16              messages                              controller-1 Keepalived_vrrp[613342]: (VR_4) Receive advertisement timeout

Comment 5 Bogdan Dobrelya 2020-04-16 09:15:04 UTC
Perhaps https://review.opendev.org/#/q/I31237d21527a2909a1669cb6c80cc0fa9be798a6 has been addressing this?

Comment 6 Slawek Kaplonski 2020-04-16 11:44:05 UTC
In L3 agent logs I see a lot of errors like:

2020-04-15 14:50:47.874 130335 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['radvd-kill', '9', '502528'] execute_rootwrap_daemon /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:103
2020-04-15 14:50:47.879 130335 ERROR neutron.agent.linux.utils [-] Rootwrap error running command: ['radvd-kill', '9', '502528']: multiprocessing.managers.RemoteError:
2020-04-15 14:50:47.879 130335 ERROR neutron.agent.l3.agent [-] Error while deleting router 39c0977f-b8ee-4d1b-bf42-ea327edc155a: multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 283, in serve_client
    send(msg)
  File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 128, in send
    s = self.dumps(obj)
  File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 170, in dumps
    return json.dumps(obj, cls=RpcJSONEncoder).encode('utf-8')
  File "/usr/lib64/python3.6/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.6/site-packages/oslo_rootwrap/jsonrpc.py", line 43, in default
    return super(RpcJSONEncoder, self).default(o)
  File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'ValueError' is not JSON serializable

---------------------------------------------------------------------------

Which leads me to http://bugzilla.redhat.com/1823795 which has similar issues. I just proposed patch https://review.opendev.org/720431 to fix this issue.
Patch is very simple, it's just addition of one line in /etc/neutron/rootwrap.d/l3.filters file and it should be applied inside neutron-l3-agent containers. Can we maybe try to apply such change manually in some environment and try if those tests will pass then?

Comment 8 Bernard Cafarelli 2020-04-20 13:29:21 UTC
Adding dep bugs, once both are fixed it should help here but keeping opened for tracking

Comment 10 Bernard Cafarelli 2020-04-23 14:29:09 UTC
Adding another recent bug, this one causes limitations on spawned dnsmasq processes so it may be visible too in VM connectivity errors

Comment 13 Bernard Cafarelli 2020-05-06 14:13:04 UTC
Updating status, this will be fixed by upstream stable/train revert https://review.opendev.org/#/c/725162/ (waiting on upstream gates)


Note You need to log in before you can comment on or make changes to this bug.