Bug 1086077 - RPC error in neutron-server leads to cascading failure
Summary: RPC error in neutron-server leads to cascading failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z4
: 4.0
Assignee: Miguel Angel Ajo
QA Contact: Nir Magnezi
URL:
Whiteboard:
: 1085997 (view as bug list)
Depends On:
Blocks: 1086011
TreeView+ depends on / blocked
 
Reported: 2014-04-10 04:33 UTC by Ian Wienand
Modified: 2022-07-09 06:16 UTC (History)
11 users (show)

Fixed In Version: openstack-neutron-2013.2.3-6.el6ost
Doc Type: Bug Fix
Doc Text:
Cause: Under heavy AMQP load, an RPC error occurs in neutron-server. Consequence: Virtual machines lose network connection. Neutron-server doesn't respond to any RPC call, which causes a cascading failure in all of the compute-node's dhcp-agents, openvswitch-agent, etc. Fix: The Qpid reconnect methods have been fixed to capture more generic qpid_exceptions.MessagingError instead of qpid_exceptions.ConnectError. Result: This causes a message consumer to properly reconnect to Qpid after certain reconnection failures.
Clone Of:
Environment:
Last Closed: 2014-05-29 20:19:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1303890 0 None None None Never
OpenStack gerrit 86502 0 None None None Never
Red Hat Knowledge Base (Solution) 1147053 0 None None None Never
Red Hat Product Errata RHSA-2014:0516 0 normal SHIPPED_LIVE Moderate: openstack-neutron security, bug fix, and enhancement update 2014-05-30 00:15:59 UTC

Internal Links: 1086004

Description Ian Wienand 2014-04-10 04:33:17 UTC
Description of problem:

We hit a problem in oslab which manifests as hosts not getting any network.  Eventually I traced it down to neutron-server not responding to RPC calls; this causes a cascading failure in all of the compute-node's dhcp-agents, openvswitch-agents, etc.

Going right back to the root cause, neutron-server first logged 

---
2014-04-09 19:02:10.901 7583 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
2014-04-09 19:02:10.903 7583 WARNING neutron.db.agentschedulers_db [-] Fail scheduling network {'status': u'ACTIVE', 'subnets': [u'fc0eea49-5f08-4029-b806-ba1b61f1e7ba'], 'name': u'network1-net', 'provider:physical_network': None, 'admin_state_up': True, 'tenant_id':
+u'210a99a1e68f43218f4cab705c908d45', 'provider:network_type': u'gre', 'router:external': False, 'shared': False, 'id': u'1c8f57e6-fa66-4bba-90e3-1a735cab79e2', 'provider:segmentation_id': 46L}
2014-04-09 19:02:10.908 7583 WARNING neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api [-] Only 0 of 1 DHCP agents associated with network '1c8f57e6-fa66-4bba-90e3-1a735cab79e2' are marked as active, so notifications may be sent to inactive agents.
2014-04-09 19:02:15.165 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:19.356 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:21.220 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:25.363 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:26.684 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:28.296 7583 WARNING neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api [-] Only 0 of 1 DHCP agents associated with network '1c8f57e6-fa66-4bba-90e3-1a735cab79e2' are marked as active, so notifications may be sent to inactive agents.
2014-04-09 19:02:31.062 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:32.163 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:33.223 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:39.974 7583 ERROR root [-] Unexpected exception occurred 61 time(s)... retrying.
2014-04-09 19:02:39.974 7583 TRACE root Traceback (most recent call last):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
2014-04-09 19:02:39.974 7583 TRACE root     return infunc(*args, **kwargs)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 686, in _consumer_thread
2014-04-09 19:02:39.974 7583 TRACE root     self.consume()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 677, in consume
2014-04-09 19:02:39.974 7583 TRACE root     it.next()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 594, in iterconsume
2014-04-09 19:02:39.974 7583 TRACE root     yield self.ensure(_error_callback, _consume)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 528, in ensure
2014-04-09 19:02:39.974 7583 TRACE root     return method(*args, **kwargs)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 585, in _consume
2014-04-09 19:02:39.974 7583 TRACE root     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "<string>", line 6, in next_receiver
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 660, in next_receiver
2014-04-09 19:02:39.974 7583 TRACE root     if self._ecwait(lambda: self.incoming, timeout):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-04-09 19:02:39.974 7583 TRACE root     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
2014-04-09 19:02:39.974 7583 TRACE root     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
2014-04-09 19:02:39.974 7583 TRACE root     self.check_error()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 202, in check_error
2014-04-09 19:02:39.974 7583 TRACE root     raise self.error
2014-04-09 19:02:39.974 7583 TRACE root InternalError: Traceback (most recent call last):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 651, in write
2014-04-09 19:02:39.974 7583 TRACE root     self._op_dec.write(*self._seg_dec.read())
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
2014-04-09 19:02:39.974 7583 TRACE root     if self.op.headers is None:
2014-04-09 19:02:39.974 7583 TRACE root AttributeError: 'NoneType' object has no attribute 'headers'
2014-04-09 19:02:39.974 7583 TRACE root
2014-04-09 19:02:39.974 7583 TRACE root
---

it kept logging similar output until it was killed by an admin.  From this point on it appears to have stopped processing rpc calls, with predictably bad consequences 

---
[root@host02 log]# qpid-stat -q -S msg
Queues
  queue                                     dur  autoDel  excl  msg    msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ==========================================================================================================================
  q-plugin                                                      59.6k  2.38m  2.32m   43.6m  1.73g    1.69g        0     2
  q-l3-plugin                                                     65   12.8k  12.7k   35.0k  7.05m    7.02m        0     2
---

Seems this has been noticed [1] and describes the problem.  As mentioned there is a review out

[1] https://bugs.launchpad.net/oslo/+bug/1303890
[2] https://review.openstack.org/85750

Comment 4 Alan Pevec 2014-04-22 10:40:34 UTC
*** Bug 1085997 has been marked as a duplicate of this bug. ***

Comment 12 Nir Magnezi 2014-05-19 09:21:27 UTC
Verified NVR: openstack-neutron-2013.2.3-7.el6ost

verified with neutron sanity tests.

Comment 15 errata-xmlrpc 2014-05-29 20:19:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html


Note You need to log in before you can comment on or make changes to this bug.