Hide Forgot
Description of problem: L3 agent failing with "Failed to process compatible router" errors 2016-12-13 21:54:32.142 119233 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'get_process' 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info Traceback (most recent call last): 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 345, in call 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info return func(*args, **kwargs) 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 629, in process 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info self._process_internal_ports() 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 382, in _process_internal_ports 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info self.internal_network_added(p) 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 275, in internal_network_added 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info self._disable_ipv6_addressing_on_interface(interface_name) 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 235, in _disable_ipv6_addressing_on_interface 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info if self._should_delete_ipv6_lladdr(ipv6_lladdr): 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 217, in _should_delete_ipv6_lladdr 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info if manager.get_process().active: 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info AttributeError: 'NoneType' object has no attribute 'get_process' 2016-12-13 21:54:32.142 119233 TRACE neutron.agent.l3.router_info Version-Release number of selected component (if applicable): RH OSP 7 How reproducible: always Steps to Reproduce: 1. restart the l3-agents from the pacemaker Actual results: L3 agents not able to start and there is packet loss when trying to ping the Floating IP. Not able to create new Floating IPs as well. Expected results: L3 agent starting normally and no packet loss when reaching the Floating IP
The neutron network/router setup for IPV6 that was causing the issue has been identified and removed. Once these were removed and services restarted, the instability and packets loss have ceased. However, the issue could return anytime if a tenant creates a network using IPV6.
In the logs, we see that the router failed to initialize because of NoFilterMatched error from rootwrap on executing the following: wrapper.netns.execute(['sysctl', '-w', 'net.ipv4.conf.all.promote_secondaries=1']) Sadly, rootwrap does not distinguish between command failure and filter matching failures. l3.filters file contains the needed filter, so I assume that it's the command that failed. audit.log does not contain any selinux denials. Since you mentioned ipv6, I wonder if the sysctl knob is available in namespaces that belong to HA routers that are ipv6 only. I guess it's something to validate on a test setup.
On second look, I see probably related errors in dhcp agent log long before: 2016-01-20 12:35:10.401 82301 TRACE neutron.agent.dhcp.agent Unserializable message: ('#ERROR', FilterMatchNotExecutable()) I wonder if rootwrap is somehow broken, or filters not deployed correctly. Now it seems like an issue that is not specific to HA routers.
Speaking of SELinux, I see lots of duplicate messages i journal log during the error like: Dec 13 18:36:41 rhqe-bare-ctrl-1.localdomain kernel: SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts Why does SELinux initialize something 10+ times per second?
I concur with Ihar: the error first appeared with a NoFiltersMatch issue as mentioned. Since the l3-agent's way of handling this error is by retrying to request, this will retry the entire processing request again and again (even though nothing has changed), resulting in [1] (which is the result of the initial request being "partially" committed and the floating ip to be already gone. [1]: http://pastebin.test.redhat.com/439169
(In reply to Ihar Hrachyshka from comment #3) > In the logs, we see that the router failed to initialize because of > NoFilterMatched error from rootwrap on executing the following: > > wrapper.netns.execute(['sysctl', '-w', > 'net.ipv4.conf.all.promote_secondaries=1']) > > Sadly, rootwrap does not distinguish between command failure and filter > matching failures. l3.filters file contains the needed filter, so I assume > that it's the command that failed. > > audit.log does not contain any selinux denials. > > Since you mentioned ipv6, I wonder if the sysctl knob is available in > namespaces that belong to HA routers that are ipv6 only. I guess it's > something to validate on a test setup. So I used the following commands to check if this command completes successfully against controllers ansible -m shell -a 'ip netns add ggillies' '*ctrl*' ansible -m shell -a 'cmd="ip netns exec ggillies sysctl -w net.ipv4.conf.all.promote_secondaries=1"' '*ctrl*' they all returned 10.9.38.32 | SUCCESS | rc=0 >> net.ipv4.conf.all.promote_secondaries = 1 So it seems, at least in productions current state, that this command works without issue (It might have been broken before)
@ Graeme Gillies , @ Caetano Medeiros Can you please add files under "etc/neutron/rootwrap.d" to sosreports? I am not seeing them in http://collab-shell.usersys.redhat.com/01757392/ Thanks Anil
Rootwrap filter files are fine. I think issue is not related to rootwrap filters. I see many rabbitmq connection broken issues in neutron server, ovs agent and l3 logs when we see neutron errors. May be we see neutron errors because of broken rabbitmq and sql connections. Can the customer make sure that these connections are stable before creating neutron or nova resources?
Note comment 18.
Hi Purandhar, Could you please approve that the fix was tested? If it was ok I will perform a code existence and verify the bug. tnx
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1747
Unitest will validate this behavior doesn't occur - https://code.engineering.redhat.com/gerrit/#/c/102096/