Bug 1496916

Summary: Error while deleting router - infinite log loop
Product: Red Hat OpenStack Reporter: Andreas Karis <akaris>
Component: openstack-neutronAssignee: Ihar Hrachyshka <ihrachys>
Status: CLOSED ERRATA QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: amuller, bhaley, chrisw, jlibosva, nyechiel, sclewis, srevivo, stphilli, tfreger
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2015.1.4-24.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-05 10:47:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
l3-agent.log none

Description Andreas Karis 2017-09-28 18:35:24 UTC
Description of problem:

Running into:

~~~
l3-agent.log:2017-09-26 15:01:36.318 7399 ERROR neutron.agent.l3.agent [-] Error while deleting router 12f8b6b7-ce4d-492c-93b6-06c6f2e3a894
l3-agent.log:2017-09-26 15:01:36.323 7399 ERROR neutron.agent.l3.agent [-] Error while deleting router 0c780871-ad14-40f8-8e09-7ee786abf532
~~~

~~~
l3-agent.log:2017-09-26 15:01:36.323 7399 ERROR neutron.agent.l3.agent [-] Error while deleting router 0c780871-ad14-40f8-8e09-7ee786abf532
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent Traceback (most recent call last):
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 355, in _safe_router_removed
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     self._router_removed(router_id)
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 372, in _router_removed
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     ri.delete(self)
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 355, in delete
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     self.destroy_state_change_monitor(self.process_monitor)
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 320, in destroy_state_change_monitor
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     pm = self._get_state_change_monitor_process_manager()
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 291, in _get_state_change_monitor_process_manager
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     default_cmd_callback=self._get_state_change_monitor_callback())
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 294, in _get_state_change_monitor_callback
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     ha_device = self.get_ha_device_name()
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 138, in get_ha_device_name
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     return (HA_DEV_PREFIX + self.ha_port['id'])[:self.driver.DEV_NAME_LEN]
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent TypeError: 'NoneType' object has no attribute '__getitem__'
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent 
l3-agent.log-2017-09-26 15:01:36.324 7399 INFO neutron.agent.l3.agent [-] Finished a router update for 0c780871-ad14-40f8-8e09-7ee786abf532
~~~

These error messages are very annoying, and very likely are related to the absence of:

https://review.openstack.org/#/c/354406/ 
https://bugs.launchpad.net/neutron/+bug/1606844

Can we backport the above? Or find a workaround (that does not include restarting services?

Version-Release number of selected component (if applicable):
openstack-neutron-2015.1.4-16.el7ost.noarch                 Mon Jul 31 20:49:33 2017
openstack-neutron-common-2015.1.4-16.el7ost.noarch          Mon Jul 31 20:49:32 2017
openstack-neutron-fwaas-2015.1.2-1.el7ost.noarch            Thu Feb 23 15:54:20 2017
openstack-neutron-lbaas-2015.1.4-1.el7ost.noarch            Wed Jul 26 17:37:42 2017
openstack-neutron-ml2-2015.1.4-16.el7ost.noarch             Mon Jul 31 20:49:34 2017
openstack-neutron-openvswitch-2015.1.4-16.el7ost.noarch     Mon Jul 31 20:49:34 2017
python-neutron-2015.1.4-16.el7ost.noarch                    Mon Jul 31 20:49:31 2017
python-neutron-fwaas-2015.1.2-1.el7ost.noarch               Thu Feb 23 15:54:19 2017
python-neutron-lbaas-2015.1.4-1.el7ost.noarch               Wed Jul 26 17:35:47 2017
python-neutronclient-2.4.0-2.el7ost.noarch                  Thu Feb 23 15:52:56 2017

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
A reboot of servers fixes the issue
Probably, a reboot of the L3 agent does fix it as well (not yet confirmed)

Thanks!

Comment 1 Andreas Karis 2017-09-28 18:40:21 UTC
About the L3 agent router deletion issue:
~~~
l3-agent.log:2017-09-26 15:01:36.318 7399 ERROR neutron.agent.l3.agent [-] Error while deleting router 12f8b6b7-ce4d-492c-93b6-06c6f2e3a894
l3-agent.log:2017-09-26 15:01:36.323 7399 ERROR neutron.agent.l3.agent [-] Error while deleting router 0c780871-ad14-40f8-8e09-7ee786abf532
~~~

~~~
l3-agent.log:2017-09-26 15:01:36.323 7399 ERROR neutron.agent.l3.agent [-] Error while deleting router 0c780871-ad14-40f8-8e09-7ee786abf532
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent Traceback (most recent call last):
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 355, in _safe_router_removed
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     self._router_removed(router_id)
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 372, in _router_removed
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     ri.delete(self)
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 355, in delete
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     self.destroy_state_change_monitor(self.process_monitor)
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 320, in destroy_state_change_monitor
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     pm = self._get_state_change_monitor_process_manager()
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 291, in _get_state_change_monitor_process_manager
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     default_cmd_callback=self._get_state_change_monitor_callback())
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 294, in _get_state_change_monitor_callback
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     ha_device = self.get_ha_device_name()
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 138, in get_ha_device_name
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent     return (HA_DEV_PREFIX + self.ha_port['id'])[:self.driver.DEV_NAME_LEN]
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent TypeError: 'NoneType' object has no attribute '__getitem__'
l3-agent.log-2017-09-26 15:01:36.323 7399 TRACE neutron.agent.l3.agent 
l3-agent.log-2017-09-26 15:01:36.324 7399 INFO neutron.agent.l3.agent [-] Finished a router update for 0c780871-ad14-40f8-8e09-7ee786abf532
~~~

This is failing here:
~~~
 30 HA_DEV_PREFIX = 'ha-'
(...)
135 
136     def get_ha_device_name(self):
137         return (HA_DEV_PREFIX + self.ha_port['id'])[:self.driver.DEV_NAME_LEN]
138 
~~~

So this is getting a substring from ha-<port ID>, and give the error message. self.ha_port['id'] returns None, and the substring operation fails:
~~~
>>> a = None
>>> a[:2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object has no attribute '__getitem__'
~~~

The installed RPMs are:
~~~
 cat installed-rpms | grep neutron
openstack-neutron-2015.1.4-16.el7ost.noarch                 Mon Jul 31 20:49:33 2017
openstack-neutron-common-2015.1.4-16.el7ost.noarch          Mon Jul 31 20:49:32 2017
openstack-neutron-fwaas-2015.1.2-1.el7ost.noarch            Thu Feb 23 15:54:20 2017
openstack-neutron-lbaas-2015.1.4-1.el7ost.noarch            Wed Jul 26 17:37:42 2017
openstack-neutron-ml2-2015.1.4-16.el7ost.noarch             Mon Jul 31 20:49:34 2017
openstack-neutron-openvswitch-2015.1.4-16.el7ost.noarch     Mon Jul 31 20:49:34 2017
python-neutron-2015.1.4-16.el7ost.noarch                    Mon Jul 31 20:49:31 2017
python-neutron-fwaas-2015.1.2-1.el7ost.noarch               Thu Feb 23 15:54:19 2017
python-neutron-lbaas-2015.1.4-1.el7ost.noarch               Wed Jul 26 17:35:47 2017
python-neutronclient-2.4.0-2.el7ost.noarch                  Thu Feb 23 15:52:56 2017
~~~

Compared to a lab:
~~~
---> Package openstack-neutron.noarch 0:2015.1.4-14.el7ost will be updated
---> Package openstack-neutron.noarch 0:2015.1.4-21.el7ost will be an update
---> Package openstack-neutron-common.noarch 0:2015.1.4-14.el7ost will be updated
---> Package openstack-neutron-common.noarch 0:2015.1.4-21.el7ost will be an update
---> Package openstack-neutron-ml2.noarch 0:2015.1.4-14.el7ost will be updated
---> Package openstack-neutron-ml2.noarch 0:2015.1.4-21.el7ost will be an update
---> Package openstack-neutron-openvswitch.noarch 0:2015.1.4-14.el7ost will be updated
---> Package openstack-neutron-openvswitch.noarch 0:2015.1.4-21.el7ost will be an update
~~~


There are several bugzillas and upstream bugs about this:
https://bugzilla.redhat.com/show_bug.cgi?id=1285882
https://bugs.launchpad.net/neutron/+bug/1606844
https://bugs.launchpad.net/neutron/+bug/1605546
https://bugs.launchpad.net/neutron/+bug/1533441


Additionally to the above, there is this warning message (right after the stack trace):
~~~
2017-09-26 15:01:36.325 7399 WARNING neutron.agent.linux.iptables_manager [-] Tried to remove rule that was not there: 'INPUT' '-m mar
k --mark 0x1 -j ACCEPT' True False
2017-09-26 15:01:36.325 7399 WARNING neutron.agent.linux.iptables_manager [-] Tried to remove rule that was not there: 'INPUT' '-p tcp
 -m tcp --dport 9697 -j DROP' True False
2017-09-26 15:01:36.325 7399 WARNING neutron.agent.linux.iptables_manager [-] Tried to remove rule that was not there: 'PREROUTING' '-
d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j MARK --set-xmark 0x1/0xffffffff' True False
2017-09-26 15:01:36.326 7399 WARNING neutron.agent.linux.iptables_manager [-] Tried to remove rule that was not there: 'PREROUTING' '-
d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j REDIRECT --to-port 9697' True False
~~~

There is https://bugs.launchpad.net/neutron/+bug/1606844
Which was fixed with https://review.openstack.org/#/c/354406/ on Aug 11, 2016

I picked one example file which was changed to see if this was already in OSP 7:
https://review.openstack.org/#/c/354406/1/neutron/agent/l3/namespaces.py

But it doesn't seem so (logical, OSP 7 is kilo, and upstream only fixes the bug until OSP 8):
~~~
[root@undercloud-7 ~]# rpm -qf /usr/lib/python2.7/site-packages/neutron/agent/l3/namespaces.py
python-neutron-2015.1.4-21.el7ost.noarch
[root@undercloud-7 ~]# egrep -v '^#' /usr/lib/python2.7/site-packages/neutron/agent/l3/namespaces.py | head -n 10

from oslo_log import log as logging

from neutron.agent.linux import ip_lib
from neutron.i18n import _LE

LOG = logging.getLogger(__name__)

NS_PREFIX = 'qrouter-'
INTERNAL_DEV_PREFIX = 'qr-'
[root@undercloud-7 ~]# 
~~~

~~~
[root@undercloud-7 ~]# grep check_ns_exi /usr/lib/python2.7/site-packages/neutron/agent/l3/namespaces.py
[root@undercloud-7 ~]# 
~~~

Thanks,

Andreas

Comment 4 Ihar Hrachyshka 2017-10-23 18:18:39 UTC
Easiest steps to reproduce would be:

- create a router
- wait until l3 agent creates its namespace
- remove the namespace using 'ip netns' tool
- remove the router via neutron api
- observe that l3 agent gets into the error loop trying to remove the namespace for the router that no longer exists

Comment 7 Toni Freger 2017-11-26 10:59:00 UTC
Created attachment 1359133 [details]
l3-agent.log

Comment 8 Brian Haley 2017-11-27 20:29:47 UTC
Hi Toni,

The traceback here seems similar to that from my comment in https://bugzilla.redhat.com/show_bug.cgi?id=1510162

My comment from there:

"The backtrace looks like another bug in this code path that would be present in all releases.  self.process_monitor is only initialized in a super() call from the HA router initialize code.  In this case initialize() failed early and super() was never called.  I need to open an upstream bug and propose a change there.  This would have been triggered even without the new code from what I can tell and was just a race condition waiting to happen."

For that reason I feel this is a new problem.

Comment 10 Brian Haley 2017-11-28 16:38:44 UTC
The failure is related to the changes, but is actually a new bug - i.e. fixing one bug uncovered another.  I am fine with this small change and the one for https://bugzilla.redhat.com/show_bug.cgi?id=1510162 merging which are related since they make the original failure more recoverable and do not fill the log files unnecessarily.

Hopefully Toni will agree.

Comment 12 Toni Freger 2017-12-04 16:23:34 UTC
Since functionality wasn't damaged it reasonable to work on new bugs and to move this one to verified.

Comment 15 errata-xmlrpc 2017-12-05 10:47:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3381