Bug 1404534

Summary: L3 agent failing with "Failed to process compatible router" errors
Product: Red Hat OpenStack Reporter: PURANDHAR SAIRAM MANNIDI <pmannidi>
Component: openstack-neutronAssignee: anil venkata <vkommadi>
Status: CLOSED ERRATA QA Contact: Alexander Stafeyev <astafeye>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: amuller, chrisw, cmedeiro, dalvarez, ggillies, ihrachys, jmelvin, jschwarz, ljozsa, nyechiel, oblaut, pmannidi, srevivo, tfreger, vkommadi
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2015.1.4-15.el7ost Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1437813 (view as bug list) Environment:
Last Closed: 2017-07-12 13:15:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1437813, 1437818, 1437820    

Description PURANDHAR SAIRAM MANNIDI 2016-12-14 03:56:53 UTC
Description of problem:
L3 agent failing with "Failed to process compatible router" errors

2016-12-13 21:54:32.142 119233 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'get_process'
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info Traceback (most recent call last):
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 345, in call
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info     return func(*args, **kwargs)
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 629, in process
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info     self._process_internal_ports()
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 382, in _process_internal_ports
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info     self.internal_network_added(p)
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 275, in internal_network_added
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info     self._disable_ipv6_addressing_on_interface(interface_name)
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 235, in _disable_ipv6_addressing_on_interface
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info     if self._should_delete_ipv6_lladdr(ipv6_lladdr):
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 217, in _should_delete_ipv6_lladdr
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info     if manager.get_process().active:
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info AttributeError: 'NoneType' object has no attribute 'get_process'
2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info

Version-Release number of selected component (if applicable):
RH OSP 7

How reproducible:
always

Steps to Reproduce:
1. restart the l3-agents from the pacemaker

Actual results:
L3 agents not able to start and there is packet loss when trying to ping the Floating IP. Not able to create new Floating IPs as well.

Expected results:
L3 agent starting normally and no packet loss when reaching the Floating IP

Comment 2 Caetano Medeiros 2016-12-14 05:34:12 UTC
The neutron network/router setup for IPV6 that was causing the issue has been identified and removed.

Once these were removed and services restarted, the instability and packets loss have ceased. However, the issue could return anytime if a tenant creates a network using IPV6.

Comment 3 Ihar Hrachyshka 2016-12-14 09:39:59 UTC
In the logs, we see that the router failed to initialize because of NoFilterMatched error from rootwrap on executing the following:

        wrapper.netns.execute(['sysctl', '-w',
                               'net.ipv4.conf.all.promote_secondaries=1'])

Sadly, rootwrap does not distinguish between command failure and filter matching failures. l3.filters file contains the needed filter, so I assume that it's the command that failed.

audit.log does not contain any selinux denials.

Since you mentioned ipv6, I wonder if the sysctl knob is available in namespaces that belong to HA routers that are ipv6 only. I guess it's something to validate on a test setup.

Comment 4 Ihar Hrachyshka 2016-12-14 09:47:55 UTC
On second look, I see probably related errors in dhcp agent log long before:

2016-01-20 12:35:10.401 82301 TRACE neutron.agent.dhcp.agent Unserializable message: ('#ERROR', FilterMatchNotExecutable())

I wonder if rootwrap is somehow broken, or filters not deployed correctly. Now it seems like an issue that is not specific to HA routers.

Comment 5 Ihar Hrachyshka 2016-12-14 09:58:54 UTC
Speaking of SELinux, I see lots of duplicate messages i journal log during the error like:

Dec 13 18:36:41 rhqe-bare-ctrl-1.localdomain kernel: SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts

Why does SELinux initialize something 10+ times per second?

Comment 6 John Schwarz 2016-12-14 12:00:24 UTC
I concur with Ihar: the error first appeared with a NoFiltersMatch issue as mentioned. Since the l3-agent's way of handling this error is by retrying to request, this will retry the entire processing request again and again (even though nothing has changed), resulting in [1] (which is the result of the initial request being "partially" committed and the floating ip to be already gone.

[1]: http://pastebin.test.redhat.com/439169

Comment 7 Graeme Gillies 2016-12-15 02:52:57 UTC
(In reply to Ihar Hrachyshka from comment #3)
> In the logs, we see that the router failed to initialize because of
> NoFilterMatched error from rootwrap on executing the following:
> 
>         wrapper.netns.execute(['sysctl', '-w',
>                                'net.ipv4.conf.all.promote_secondaries=1'])
> 
> Sadly, rootwrap does not distinguish between command failure and filter
> matching failures. l3.filters file contains the needed filter, so I assume
> that it's the command that failed.
> 
> audit.log does not contain any selinux denials.
> 
> Since you mentioned ipv6, I wonder if the sysctl knob is available in
> namespaces that belong to HA routers that are ipv6 only. I guess it's
> something to validate on a test setup.

So I used the following commands to check if this command completes successfully against controllers

ansible -m shell -a 'ip netns add ggillies' '*ctrl*'
ansible -m shell -a 'cmd="ip netns exec ggillies sysctl -w net.ipv4.conf.all.promote_secondaries=1"' '*ctrl*'

they all returned

10.9.38.32 | SUCCESS | rc=0 >>
net.ipv4.conf.all.promote_secondaries = 1

So it seems, at least in productions current state, that this command works without issue (It might have been broken before)

Comment 8 anil venkata 2016-12-21 15:20:46 UTC
@ Graeme Gillies , @ Caetano Medeiros 

Can you please add files under "etc/neutron/rootwrap.d" to sosreports?
I am not seeing them in http://collab-shell.usersys.redhat.com/01757392/

Thanks
Anil

Comment 18 anil venkata 2017-01-13 04:21:30 UTC
Rootwrap filter files are fine. I think issue is not related to rootwrap filters. I see many rabbitmq connection broken issues in neutron server, ovs agent and l3 logs when we see neutron errors. May be we see neutron errors because of broken rabbitmq and sql connections. Can the customer make sure that these connections are stable before creating neutron or nova resources?

Comment 19 Assaf Muller 2017-01-24 15:13:01 UTC
Note comment 18.

Comment 22 Alexander Stafeyev 2017-07-05 07:41:27 UTC
Hi Purandhar, 
Could you please approve that the fix was tested? 
If it was ok I will perform a code existence and verify the bug. 

tnx

Comment 26 errata-xmlrpc 2017-07-12 13:15:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1747

Comment 27 Toni Freger 2018-02-05 11:23:48 UTC
Unitest will validate this behavior doesn't occur - https://code.engineering.redhat.com/gerrit/#/c/102096/