Bug 1393097

Summary: Routers begin to stop adding floating ip's to their namespace. floating ips then no longer reachable
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-neutronAssignee: anil venkata <vkommadi>
Status: CLOSED ERRATA QA Contact: GenadiC <gcheresh>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: amuller, chrisw, dalvarez, ebarrera, jmelvin, jschwarz, majopela, nyechiel, oblaut, rcernin, srevivo, vkommadi
Target Milestone: zstreamKeywords: Triaged, Unconfirmed, ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-7.2.0-12.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-12 13:17:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jeremy 2016-11-08 20:52:15 UTC
Description of problem: After some time and unknown trigger, routers namespace do not remove old floating ip's and newly attached floating ip's are not added to the namespace. As a result floating ip's for vms are not pingable.


Version-Release number of selected component (if applicable):
openstack-neutron-7.0.4-3.el7ost.noarch

How reproducible:
unknown

Steps to Reproduce:
1.unknown
2.
3.

Actual results:
floating ip's not pingable

Expected results:
floaters pingable

Additional info:

stack@osstage-undercloud ~]$ nova list --all |grep cirros5
| 35381639-fd44-4767-a2d2-13426ea1dcac | cirros5                                                 | af273396f30b45cdae05873f90867ceb | ACTIVE | -          | Running     | sr-kanata_net=192.168.110.33, 135.121.90.212                                                                                                                                                          |

 114a72fe-02ae-4b87-a2e7-70f962df0951 | sr-kanata_router            | {"network_id": "ce4b20aa-5c61-49d0-80ab-03697c818822", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "1ff39be2-ab26-4a17-8598-26f632dc6caf", "ip_address": "135.121.90.221"}]} | False       | True |


### namespace isn't clearing out old floating ip's and new ip's are not added.

[heat-admin@overcloud-controller-1 ~]$ sudo ip netns exec qrouter-114a72fe-02ae-4b87-a2e7-70f962df0951 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
55: ha-086c409c-62: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:fd:19:b4 brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.2/18 brd 169.254.255.255 scope global ha-086c409c-62
       valid_lft forever preferred_lft forever
    inet 169.254.0.1/24 scope global ha-086c409c-62
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fefd:19b4/64 scope link 
       valid_lft forever preferred_lft forever
56: qg-63602c60-a2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:d7:53:b0 brd ff:ff:ff:ff:ff:ff
    inet 135.121.90.221/23 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.241/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.242/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.207/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.187/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.190/32 brd 135.121.90.190 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fed7:53b0/64 scope link nodad 
       valid_lft forever preferred_lft forever
58: qr-e63406e1-e7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:72:4e:a5 brd ff:ff:ff:ff:ff:ff
    inet 192.168.110.1/24 scope global qr-e63406e1-e7
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe72:4ea5/64 scope link nodad 
       valid_lft forever preferred_lft forever
[heat-admin@overcloud-controller-1 ~]$ 

###service restart clears the issue
##namespace after l3-agent and neutron-server restart


[heat-admin@overcloud-controller-1 ~]$ sudo ip netns exec qrouter-114a72fe-02ae-4b87-a2e7-70f962df0951 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
55: ha-086c409c-62: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:fd:19:b4 brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.2/18 brd 169.254.255.255 scope global ha-086c409c-62
       valid_lft forever preferred_lft forever
    inet 169.254.0.1/24 scope global ha-086c409c-62
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fefd:19b4/64 scope link 
       valid_lft forever preferred_lft forever
56: qg-63602c60-a2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:d7:53:b0 brd ff:ff:ff:ff:ff:ff
    inet 135.121.90.221/23 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.241/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.242/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.207/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.187/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.190/32 brd 135.121.90.190 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet 135.121.90.213/32 scope global qg-63602c60-a2
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fed7:53b0/64 scope link nodad 
       valid_lft forever preferred_lft forever
58: qr-e63406e1-e7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:72:4e:a5 brd ff:ff:ff:ff:ff:ff
    inet 192.168.110.1/24 scope global qr-e63406e1-e7
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe72:4ea5/64 scope link nodad 
       valid_lft forever preferred_lft forever
[heat-admin@overcloud-controller-1 ~]$ ^C
[heat-admin@overcloud-controller-1 ~]$


/var/log/neutron/l3-agent.log"

2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '114a72fe-02ae-4b87-a2e7-70f962df0951'
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 504, in _process_router_update
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 443, in _process_router_if_compatible
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     self._process_updated_router(router)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 457, in _process_updated_router
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     ri.process(self)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 376, in process
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     super(HaRouter, self).process(agent)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 359, in call
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     self.logger(e)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     six.reraise(self.type_, self.value, self.tb)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 356, in call
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     return func(*args, **kwargs)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 702, in process
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     self._process_internal_ports(agent.pd)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 390, in _process_internal_ports
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     self.internal_network_added(p)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 274, in internal_network_added
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     self._disable_ipv6_addressing_on_interface(interface_name)
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 234, in _disable_ipv6_addressing_on_interface
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     if self._should_delete_ipv6_lladdr(ipv6_lladdr):
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 216, in _should_delete_ipv6_lladdr
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent     if manager.get_process().active:
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'get_process'
2016-11-08 16:20:19.530 13191 ERROR neutron.agent.l3.agent

Comment 2 Jeremy 2016-11-09 23:16:49 UTC
Hello,

Enabled debug logs; reproduced and see mysql deadlocks?


2016-11-09 19:15:43.968 17574 ERROR neutron.agent.l3.router_info [-] Remote error: DBDeadlock (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'UPDATE floatingips SET status=%s WHERE floatingips.id = %s'] [parameters: (u'DOWN', u'f73c6ee8-1c43-483b-9b4c-ea9087e87db9')]

2016-11-09 19:15:43.880 17574 DEBUG neutron.agent.l3.agent [-] Sending floating ip statuses: {u'57538b3a-48a5-497e-94e6-fc36ee31720d': 'DOWN', u'f2d3d46c-faee-460c-b858-eaf23253324e': 'DOWN', u'74608000-7e41-4a3f-8b06-91cdb54a6e37': 'DOWN', u'f73c6ee8-1c43-483b-9b4c-ea9087e87db9': 'DOWN'} update_fip_statuses /usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py:377
2016-11-09 19:15:43.880 17574 DEBUG oslo_messaging._drivers.amqpdriver [-] MSG_ID is cbb6228848dc43d79e54c377753e5ef3 _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:392
2016-11-09 19:15:43.968 17574 ERROR neutron.agent.l3.router_info [-] Remote error: DBDeadlock (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'UPDATE floatingips SET status=%s WHERE floatingips.id = %s'] [parameters: (u'DOWN', u'f73c6ee8-1c43-483b-9b4c-ea9087e87db9')]
[u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n    executor_callback))\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n    executor_callback)\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch\n    result = func(ctxt, **new_args)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 219, in update_floatingip_statuses\n    status)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/db/l3_db.py", line 1045, in update_floatingip_status\n    fip_query.update({\'status\': status}, synchronize_session=False)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2950, in update\n    update_op.exec_()\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1106, in exec_\n    self._do_exec()\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1255, in _do_exec\n    mapper=self.mapper)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1023, in execute\n    bind, close_with_result=True).execute(clause, params or {})\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute\n    return meth(self, multiparams, params)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection\n    return connection._execute_clauseelement(self, multiparams, params)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement\n    compiled_sql, distilled_params\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context\n    context)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception\n    util.raise_from_cause(newraise, exc_info)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause\n    reraise(type(exception), exception, tb=exc_tb)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context\n    context)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute\n    cursor.execute(statement, parameters)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute\n    result = self._query(query)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query\n    conn.query(q)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 781, in query\n    self._affected_rows = self._read_query_result(unbuffered=unbuffered)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 942, in _read_query_result\n    result.read()\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read\n    first_packet = self.connection._read_packet()\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet\n    packet.check_error()\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error\n    err.raise_mysql_exception(self._data)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception\n    _check_mysql_exception(errinfo)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception\n    raise InternalError(errno, errorvalue)\n', u"DBDeadlock: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'UPDATE floatingips SET status=%s WHERE floatingips.id = %s'] [parameters: (u'DOWN', u'f73c6ee8-1c43-483b-9b4c-ea9087e87db9')]\n"].

2016-11-09 19:15:43.880 17574 DEBUG neutron.agent.l3.agent [-] Sending floating ip statuses: {u'57538b3a-48a5-497e-94e6-fc36ee31720d': 'DOWN', u'f2d3d46c-faee-460c-b858-eaf23253324e': 'DOWN', u'74608000-7e41-4a3f-8b06-91cdb54a6e37': 'DOWN', u'f73c6ee8-1c43-483b-9b4c-ea9087e87db9': 'DOWN'} update_fip_statuses /usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py:377
2016-11-09 19:15:43.880 17574 DEBUG oslo_messaging._drivers.amqpdriver [-] MSG_ID is cbb6228848dc43d79e54c377753e5ef3 _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:392
2016-11-09 19:15:43.968 17574 ERROR neutron.agent.l3.router_info [-] Remote error: DBDeadlock (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'UPDATE floatingips SET status=%s WHERE floatingips.id = %s'] [parameters: (u'DOWN', u'f73c6ee8-1c43-483b-9b4c-ea9087e87db9')]
[u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n    executor_callback))\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n    executor_callback)\n', u'  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch\n    result = func(ctxt, **new_args)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 219, in update_floatingip_statuses\n    status)\n', u'  File "/usr/lib/python2.7/site-packages/neutron/db/l3_db.py", line 1045, in update_floatingip_status\n    fip_query.update({\'status\': status}, synchronize_session=False)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2950, in update\n    update_op.exec_()\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1106, in exec_\n    self._do_exec()\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1255, in _do_exec\n    mapper=self.mapper)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1023, in execute\n    bind, close_with_result=True).execute(clause, params or {})\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute\n    return meth(self, multiparams, params)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection\n    return connection._execute_clauseelement(self, multiparams, params)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement\n    compiled_sql, distilled_params\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context\n    context)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception\n    util.raise_from_cause(newraise, exc_info)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause\n    reraise(type(exception), exception, tb=exc_tb)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context\n    context)\n', u'  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute\n    cursor.execute(statement, parameters)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute\n    result = self._query(query)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query\n    conn.query(q)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 781, in query\n    self._affected_rows = self._read_query_result(unbuffered=unbuffered)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 942, in _read_query_result\n    result.read()\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read\n    first_packet = self.connection._read_packet()\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet\n    packet.check_error()\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error\n    err.raise_mysql_exception(self._data)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception\n    _check_mysql_exception(errinfo)\n', u'  File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception\n    raise InternalError(errno, errorvalue)\n', u"DBDeadlock: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'UPDATE floatingips SET status=%s WHERE floatingips.id = %s'] [parameters: (u'DOWN', u'f73c6ee8-1c43-483b-9b4c-ea9087e87db9')]\n"].

Comment 3 John Schwarz 2016-12-01 09:14:38 UTC
From a quick glance, this sounds like some kind of a race condition in the neutron-server causing incomplete data to be sent to the agents. This causes the agents to be put into endless loops and not being able to do stuff (like add new fips).

I'll look into the sosreport and report back with more findings.

Comment 4 John Schwarz 2016-12-04 07:46:01 UTC
It would seem that, for example for router_id=1e52bb8d-39d6-4057-9f4e-8d95a5734562, partial sync occurred in a way that the l3-agent over at overcloud-controller-1 received an HA router with no ha interface for that agent. In short, there was a race condition somewhere and the agent received partial data. In such cases, the agent enters an endless loop trying to recover, but the server never does.

There are few patches that should deal with this error. I'm gonna hunt them down and make sure they are in rhos-8, and if they already are then in what version.

Comment 5 John Schwarz 2016-12-04 08:13:54 UTC
Hi,

These bugs have been fixed in openstack-neutron-7.1.1-4 (as part of bz#1360352).
Can we upgrade them to a newer version and see if this solves their issue?

John.

Comment 6 Assaf Muller 2016-12-09 21:53:53 UTC
Please see comment 5.

Comment 14 anil venkata 2016-12-22 10:47:09 UTC
@Jeremy
can you please check the path of keepalived? is it /usr/sbin/keepalived ?

Comment 15 Jeremy 2017-01-09 15:02:36 UTC
yes
[root@overcloud-controller-0 ~]# ls -l /usr/sbin/keepalived
-rwxr-xr-x. 1 root root 317992 Jul  1  2016 /usr/sbin/keepalived
[root@overcloud-controller-1 ~]# ls -l /usr/sbin/keepalived
-rwxr-xr-x. 1 root root 317992 Jul  1  2016 /usr/sbin/keepalived
[root@overcloud-controller-2 ~]# ls -l /usr/sbin/keepalived
-rwxr-xr-x. 1 root root 317992 Jul  1  2016 /usr/sbin/keepalived

Comment 16 Jeremy 2017-02-07 18:49:57 UTC
Customer just saw a recurrence of this bug in the wild.  It looks like it was triggered by clustered services restarting rabbitmq:

Feb  6 18:33:41 overcloud-controller-1 lrmd[13038]: warning: rabbitmq_monitor_10000 process (PID 23576) timed out
Feb  6 18:33:41 overcloud-controller-1 lrmd[13038]: warning: rabbitmq_monitor_10000:23576 - timed out after 40000ms
Feb  6 18:33:41 overcloud-controller-1 crmd[13041]:   error: Operation rabbitmq_monitor_10000: Timed Out (node=overcloud-controller-1, call=2506, timeout=40000ms)
Feb  6 18:33:41 overcloud-controller-1 crmd[13041]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]: warning: Processing failed op monitor for rabbitmq:2 on overcloud-controller-1: unknown error (1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:1 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:   error: Internal error: No action found for demote in redis:2 (then)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Recover rabbitmq:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-nova-scheduler:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart neutron-l3-agent:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-ceilometer-alarm-notifier:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-heat-engine:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-ceilometer-api:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart neutron-metadata-agent:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-heat-api:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-cinder-scheduler:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-nova-api:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-heat-api-cloudwatch:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-ceilometer-collector:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-keystone:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-nova-consoleauth:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-glance-registry:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-ceilometer-notification:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-cinder-api:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart neutron-dhcp-agent:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-glance-api:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart neutron-openvswitch-agent:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-nova-novncproxy:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart delay:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart neutron-server:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-ceilometer-central:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-ceilometer-alarm-evaluator:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-heat-api-cfn:2#011(Started overcloud-controller-1)
Feb  6 18:33:42 overcloud-controller-1 pengine[13040]:  notice: Restart openstack-nova-conductor:2#011(Started overcloud-controller-1)

The current error trace for the l3-agent is

2017-02-06 18:34:18.539 26120 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-114a72fe-02ae-4b87-a2e7-70f962df0951', 'ip', '-o', 'link', 'show', 'qr-e6
3406e1-e7'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:101
2017-02-06 18:34:18.544 26120 DEBUG neutron.agent.linux.utils [-]
Command: ['ip', 'netns', 'exec', u'qrouter-114a72fe-02ae-4b87-a2e7-70f962df0951', 'ip', '-o', 'link', 'show', u'qr-e63406e1-e7']
Exit code: 0
 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:156
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'get_process'
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 359, in call
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info     return func(*args, **kwargs)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 744, in process
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info     self._process_internal_ports(agent.pd)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 394, in _process_internal_ports
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info     self.internal_network_added(p)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 275, in internal_network_added
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info     self._disable_ipv6_addressing_on_interface(interface_name)
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 235, in _disable_ipv6_addressing_on_interface
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info     if self._should_delete_ipv6_lladdr(ipv6_lladdr):
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 217, in _should_delete_ipv6_lladdr
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info     if manager.get_process().active:
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info AttributeError: 'NoneType' object has no attribute 'get_process'
2017-02-06 18:34:18.544 26120 ERROR neutron.agent.l3.router_info
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '114a72fe-02ae-4b87-a2e7-70f962df0951'
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 506, in _process_router_update
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 445, in _process_router_if_compatible
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     self._process_updated_router(router)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 459, in _process_updated_router
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     ri.process(self)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 377, in process
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     super(HaRouter, self).process(agent)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 362, in call
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     self.logger(e)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     six.reraise(self.type_, self.value, self.tb)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 359, in call
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     return func(*args, **kwargs)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 744, in process
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     self._process_internal_ports(agent.pd)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 394, in _process_internal_ports
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     self.internal_network_added(p)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 275, in internal_network_added
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     self._disable_ipv6_addressing_on_interface(interface_name)
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 235, in _disable_ipv6_addressing_on_interface
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     if self._should_delete_ipv6_lladdr(ipv6_lladdr):
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 217, in _should_delete_ipv6_lladdr
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent     if manager.get_process().active:
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'get_process'
2017-02-06 18:34:18.549 26120 ERROR neutron.agent.l3.agent
2017-02-06 18:34:18.565 26120 DEBUG neutron.agent.l3.agent [-] Starting router update for 9a54a769-4a0e-487b-9eb1-3d9a1923f3fb, action None, priority 1 _process_router_update /usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py:472

Comment 17 Jeremy 2017-02-07 20:20:35 UTC
Customer ran pcs resource restart neutron-server; this did not solve the problem.

I then ran pcs resource restart neutron-netns-cleanup (which also restarts neutron-ovs-cleanup);  now the only neutron processes running from 2016 are

root      7125     1  0  2016 ?        00:00:00 sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root      7126  7125  0  2016 ?        00:00:00 /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root     24727     1  0  2016 ?        00:00:00 sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root     24729 24727  0  2016 ?        00:00:00 /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf

This appears to have resolved our issues.

Comment 18 anil venkata 2017-02-08 07:21:52 UTC
I will report this bug in upstream and provide the fix. Meanwhile can you please provide complete logs(ovs agent, l3 agent and server logs)?

Comment 19 anil venkata 2017-02-08 15:53:22 UTC
u/s change https://review.openstack.org/#/c/431026/ submitted.

Comment 23 GenadiC 2017-07-05 08:32:49 UTC
How should I verify this bz?

Comment 24 GenadiC 2017-07-05 08:42:00 UTC
Is it solved on the customer site?

Comment 25 Jeremy 2017-07-05 13:12:36 UTC
Customer is still waiting for a fix

Comment 26 Jeremy 2017-07-05 13:13:21 UTC
Customer is still waiting for a fix

Comment 27 anil venkata 2017-07-05 13:19:35 UTC
Build openstack-neutron-7.2.0-12.el7ost has the fix.

Comment 28 Jeremy 2017-07-05 15:20:05 UTC
Ok thanks, when that package is shipped I will notify the customer to apply the fix.

Comment 32 errata-xmlrpc 2017-07-12 13:17:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1742