Description of problem: OSP10 with DVR + OVS. Problem started Thursday last week after reboot of the controller nodes. Happening on production environment. Client is experiencing some floating IP flapping. When we tcpdump from the qrouter we notice the our ping failing because it doesn't have the mac-address of it's neighbor (the fpr). But after a few seconds it does learn it, but won't keep it very long (seconds). If we migrate that instance to another compute. The problem is gone, no more flapping. If we bring it back to the original compute, problem starts back. They then tried to spawn a new instance on that "good" compute but it failed with the following: 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server [req-e4f37826-dc11-4daf-8cf5-7e618ab0b004 038c55174d4fe083a08ab99a19ecfb99999aaed90ccf5e002b0317f1d96ca981 6b385ea6687e4377a0f0146b1274d65d - - -] Exception during message handling 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 71, in del_arp_entry 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server self._update_arp_entry(context, payload, 'delete') 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 63, in _update_arp_entry 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server ri._update_arp_entry(ip, mac, subnet_id, action) 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 246, in _update_arp_entry 2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server LOG.exception(_LE("DVR: Failed updating arp entry")) On that "good" compute, in the l3-agent logs it is filled with (every 5 min or so): 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router [req-4224ef08-6c3e-4b7b-b179-0b00313c7d4b 038c55174d4fe083a08ab99a19ecfb99999aaed90ccf5e002b0317f1d96ca981 6b385ea6687e4377a0f0146b1274d65d - - -] DVR: Failed updating arp entry 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router Traceback (most recent call last): 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 233, in _update_arp_entry 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router device.neigh.delete(ip, mac) 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 859, in delete 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router 'dev', self.name)) 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 387, in _as_root 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router use_root_namespace=use_root_namespace) 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 97, in _as_root 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router log_fail_as_error=self.log_fail_as_error) 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 106, in _execute 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router log_fail_as_error=log_fail_as_error) 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 144, in execute 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router raise ProcessExecutionError(msg, returncode=returncode) 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: No such file or directory 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router 2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server [req-4224ef08-6c3e-4b7b-b179-0b00313c7d4b 038c55174d4fe083a08ab99a19ecfb99999aaed90ccf5e002b0317f1d96ca981 6b385ea6687e4377a0f0146b1274d65d - - -] Exception during message handling 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 71, in del_arp_entry 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server self._update_arp_entry(context, payload, 'delete') 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 63, in _update_arp_entry 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server ri._update_arp_entry(ip, mac, subnet_id, action) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 246, in _update_arp_entry 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server LOG.exception(_LE("DVR: Failed updating arp entry")) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server self.force_reraise() 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 233, in _update_arp_entry 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server device.neigh.delete(ip, mac) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 859, in delete 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server 'dev', self.name)) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 387, in _as_root 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server use_root_namespace=use_root_namespace) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 97, in _as_root 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server log_fail_as_error=self.log_fail_as_error) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 106, in _execute 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server log_fail_as_error=log_fail_as_error) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 144, in execute 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server raise ProcessExecutionError(msg, returncode=returncode) 2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: No such file or directory So far we checked the kernel neighbor table (currently is 1K, 2K, 4K). We also checked for ovs-appctl fdb/show table size and its about 330 only. Version-Release number of selected component (if applicable): openstack-neutron-openvswitch-9.1.0-8.el7ost.noarch Mon Dec 12 18:32:54 2016 openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 Mon Dec 12 18:25:16 2016 python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch Mon Dec 12 18:14:04 2016 openstack-neutron-9.1.0-8.el7ost.noarch Mon Dec 12 18:32:46 2016 openstack-neutron-bigswitch-agent-9.40.0-1.1.el7ost.noarch Mon Dec 12 18:01:15 2016 openstack-neutron-bigswitch-lldp-9.40.0-1.1.el7ost.noarch Mon Dec 12 18:00:37 2016 openstack-neutron-common-9.1.0-8.el7ost.noarch Mon Dec 12 18:00:09 2016 openstack-neutron-lbaas-9.1.0-1.el7ost.noarch Mon Dec 12 18:32:47 2016 openstack-neutron-lbaas-ui-1.0.0-1.el7ost.noarch Thu Apr 20 15:41:21 2017 openstack-neutron-metering-agent-9.1.0-8.el7ost.noarch Mon Dec 12 18:34:18 2016 openstack-neutron-ml2-9.1.0-8.el7ost.noarch Mon Dec 12 18:14:03 2016 openstack-neutron-openvswitch-9.1.0-8.el7ost.noarch Mon Dec 12 18:32:54 2016 openstack-neutron-sriov-nic-agent-9.1.0-8.el7ost.noarch Mon Dec 12 18:35:04 2016 How reproducible: Happening right now Steps to Reproduce: 1. 2. 3. Actual results: Some floating IP flapping Others unreachable Expected results: No flapping and have floating IP work Additional info: See next comment
Marshalling some resources to take a look at this. Since this environment has been around for a while, was there a precipitating event after which these issues started to happen? Or has it been happening all along and has steadily gotten worse?
Problem started last week after rebooting all 3 controller nodes. That's what I have.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1721