Bug 1692772 - [OSP10] Flapping of floating IPs
Summary: [OSP10] Flapping of floating IPs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: z12
: 10.0 (Newton)
Assignee: Brian Haley
QA Contact: Candido Campos
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-26 12:01 UTC by ggrimaux
Modified: 2019-11-12 13:07 UTC (History)
11 users (show)

Fixed In Version: openstack-neutron-9.4.1-41.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-10 09:18:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 425862 0 'None' MERGED Do not raise an error deleting neighbour entry 2020-09-30 09:46:15 UTC
OpenStack gerrit 613536 0 'None' MERGED Add permanent ARP entries for DVR fip/qrouter veth pair 2020-09-30 09:46:15 UTC
Red Hat Product Errata RHBA-2019:1721 0 None None None 2019-07-10 09:18:45 UTC

Description ggrimaux 2019-03-26 12:01:28 UTC
Description of problem:
OSP10 with DVR + OVS. Problem started Thursday last week after reboot of the controller nodes. Happening on production environment.

Client is experiencing some floating IP flapping.

When we tcpdump from the qrouter we notice the our ping failing because it doesn't have the mac-address of it's neighbor (the fpr). But after a few seconds it does learn it, but won't keep it very long (seconds). 

If we migrate that instance to another compute. The problem is gone, no more flapping. If we bring it back to the original compute, problem starts back.

They then tried to spawn a new instance on that "good" compute but it failed with the following:

2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server [req-e4f37826-dc11-4daf-8cf5-7e618ab0b004 038c55174d4fe083a08ab99a19ecfb99999aaed90ccf5e002b0317f1d96ca981 6b385ea6687e4377a0f0146b1274d65d - - -] Exception during message handling
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 71, in del_arp_entry
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server     self._update_arp_entry(context, payload, 'delete')
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 63, in _update_arp_entry
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server     ri._update_arp_entry(ip, mac, subnet_id, action)
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 246, in _update_arp_entry
2019-03-25 13:45:28.135 589794 ERROR oslo_messaging.rpc.server     LOG.exception(_LE("DVR: Failed updating arp entry"))

On that "good" compute, in the l3-agent logs it is filled with (every 5 min or so):

2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router [req-4224ef08-6c3e-4b7b-b179-0b00313c7d4b 038c55174d4fe083a08ab99a19ecfb99999aaed90ccf5e002b0317f1d96ca981 6b385ea6687e4377a0f0146b1274d65d - - -] DVR: Failed updating arp entry
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router Traceback (most recent call last):
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 233, in _update_arp_entry
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router     device.neigh.delete(ip, mac)
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 859, in delete
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router     'dev', self.name))
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 387, in _as_root
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router     use_root_namespace=use_root_namespace)
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 97, in _as_root
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router     log_fail_as_error=self.log_fail_as_error)
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 106, in _execute
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router     log_fail_as_error=log_fail_as_error)
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 144, in execute
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router     raise ProcessExecutionError(msg, returncode=returncode)
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: No such file or directory
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router
2019-03-25 15:50:26.996 502682 ERROR neutron.agent.l3.dvr_local_router
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server [req-4224ef08-6c3e-4b7b-b179-0b00313c7d4b 038c55174d4fe083a08ab99a19ecfb99999aaed90ccf5e002b0317f1d96ca981 6b385ea6687e4377a0f0146b1274d65d - - -] Exception during message handling
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 71, in del_arp_entry
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     self._update_arp_entry(context, payload, 'delete')
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr.py", line 63, in _update_arp_entry
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     ri._update_arp_entry(ip, mac, subnet_id, action)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 246, in _update_arp_entry
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     LOG.exception(_LE("DVR: Failed updating arp entry"))
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 233, in _update_arp_entry
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     device.neigh.delete(ip, mac)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 859, in delete
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     'dev', self.name))
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 387, in _as_root
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     use_root_namespace=use_root_namespace)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 97, in _as_root
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     log_fail_as_error=self.log_fail_as_error)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 106, in _execute
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     log_fail_as_error=log_fail_as_error)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 144, in execute
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server     raise ProcessExecutionError(msg, returncode=returncode)
2019-03-25 15:50:26.996 502682 ERROR oslo_messaging.rpc.server ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: No such file or directory

So far we checked the kernel neighbor table (currently is 1K, 2K, 4K).
We also checked for ovs-appctl fdb/show table size and its about 330 only.

Version-Release number of selected component (if applicable):

openstack-neutron-openvswitch-9.1.0-8.el7ost.noarch         Mon Dec 12 18:32:54 2016
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64              Mon Dec 12 18:25:16 2016
python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch       Mon Dec 12 18:14:04 2016

openstack-neutron-9.1.0-8.el7ost.noarch                     Mon Dec 12 18:32:46 2016
openstack-neutron-bigswitch-agent-9.40.0-1.1.el7ost.noarch  Mon Dec 12 18:01:15 2016
openstack-neutron-bigswitch-lldp-9.40.0-1.1.el7ost.noarch   Mon Dec 12 18:00:37 2016
openstack-neutron-common-9.1.0-8.el7ost.noarch              Mon Dec 12 18:00:09 2016
openstack-neutron-lbaas-9.1.0-1.el7ost.noarch               Mon Dec 12 18:32:47 2016
openstack-neutron-lbaas-ui-1.0.0-1.el7ost.noarch            Thu Apr 20 15:41:21 2017
openstack-neutron-metering-agent-9.1.0-8.el7ost.noarch      Mon Dec 12 18:34:18 2016
openstack-neutron-ml2-9.1.0-8.el7ost.noarch                 Mon Dec 12 18:14:03 2016
openstack-neutron-openvswitch-9.1.0-8.el7ost.noarch         Mon Dec 12 18:32:54 2016
openstack-neutron-sriov-nic-agent-9.1.0-8.el7ost.noarch     Mon Dec 12 18:35:04 2016


How reproducible:
Happening right now

Steps to Reproduce:
1. 
2.
3.

Actual results:
Some floating IP flapping
Others unreachable

Expected results:
No flapping and have floating IP work

Additional info:
See next comment

Comment 4 Nate Johnston 2019-03-26 13:35:49 UTC
Marshalling some resources to take a look at this.  Since this environment has been around for a while, was there a precipitating event after which these issues started to happen?  Or has it been happening all along and has steadily gotten worse?

Comment 5 ggrimaux 2019-03-26 13:37:50 UTC
Problem started last week after rebooting all 3 controller nodes.
That's what I have.

Comment 30 errata-xmlrpc 2019-07-10 09:18:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1721


Note You need to log in before you can comment on or make changes to this bug.