Bug 1086077

Summary: RPC error in neutron-server leads to cascading failure
Product: Red Hat OpenStack Reporter: Ian Wienand <iwienand>
Component: openstack-neutronAssignee: Miguel Angel Ajo <majopela>
Status: CLOSED ERRATA QA Contact: Nir Magnezi <nmagnezi>
Severity: high Docs Contact:
Priority: high    
Version: 4.0CC: apevec, breeler, chrisw, iwienand, jtrowbri, lhh, majopela, nyechiel, tdunnon, tiwillia, yeylon
Target Milestone: z4Keywords: ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2013.2.3-6.el6ost Doc Type: Bug Fix
Doc Text:
Cause: Under heavy AMQP load, an RPC error occurs in neutron-server. Consequence: Virtual machines lose network connection. Neutron-server doesn't respond to any RPC call, which causes a cascading failure in all of the compute-node's dhcp-agents, openvswitch-agent, etc. Fix: The Qpid reconnect methods have been fixed to capture more generic qpid_exceptions.MessagingError instead of qpid_exceptions.ConnectError. Result: This causes a message consumer to properly reconnect to Qpid after certain reconnection failures.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-29 20:19:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1086011    

Description Ian Wienand 2014-04-10 04:33:17 UTC
Description of problem:

We hit a problem in oslab which manifests as hosts not getting any network.  Eventually I traced it down to neutron-server not responding to RPC calls; this causes a cascading failure in all of the compute-node's dhcp-agents, openvswitch-agents, etc.

Going right back to the root cause, neutron-server first logged 

---
2014-04-09 19:02:10.901 7583 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
2014-04-09 19:02:10.903 7583 WARNING neutron.db.agentschedulers_db [-] Fail scheduling network {'status': u'ACTIVE', 'subnets': [u'fc0eea49-5f08-4029-b806-ba1b61f1e7ba'], 'name': u'network1-net', 'provider:physical_network': None, 'admin_state_up': True, 'tenant_id':
+u'210a99a1e68f43218f4cab705c908d45', 'provider:network_type': u'gre', 'router:external': False, 'shared': False, 'id': u'1c8f57e6-fa66-4bba-90e3-1a735cab79e2', 'provider:segmentation_id': 46L}
2014-04-09 19:02:10.908 7583 WARNING neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api [-] Only 0 of 1 DHCP agents associated with network '1c8f57e6-fa66-4bba-90e3-1a735cab79e2' are marked as active, so notifications may be sent to inactive agents.
2014-04-09 19:02:15.165 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:19.356 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:21.220 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:25.363 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:26.684 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:28.296 7583 WARNING neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api [-] Only 0 of 1 DHCP agents associated with network '1c8f57e6-fa66-4bba-90e3-1a735cab79e2' are marked as active, so notifications may be sent to inactive agents.
2014-04-09 19:02:31.062 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:32.163 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:33.223 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:39.974 7583 ERROR root [-] Unexpected exception occurred 61 time(s)... retrying.
2014-04-09 19:02:39.974 7583 TRACE root Traceback (most recent call last):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
2014-04-09 19:02:39.974 7583 TRACE root     return infunc(*args, **kwargs)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 686, in _consumer_thread
2014-04-09 19:02:39.974 7583 TRACE root     self.consume()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 677, in consume
2014-04-09 19:02:39.974 7583 TRACE root     it.next()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 594, in iterconsume
2014-04-09 19:02:39.974 7583 TRACE root     yield self.ensure(_error_callback, _consume)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 528, in ensure
2014-04-09 19:02:39.974 7583 TRACE root     return method(*args, **kwargs)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 585, in _consume
2014-04-09 19:02:39.974 7583 TRACE root     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "<string>", line 6, in next_receiver
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 660, in next_receiver
2014-04-09 19:02:39.974 7583 TRACE root     if self._ecwait(lambda: self.incoming, timeout):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-04-09 19:02:39.974 7583 TRACE root     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
2014-04-09 19:02:39.974 7583 TRACE root     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
2014-04-09 19:02:39.974 7583 TRACE root     self.check_error()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 202, in check_error
2014-04-09 19:02:39.974 7583 TRACE root     raise self.error
2014-04-09 19:02:39.974 7583 TRACE root InternalError: Traceback (most recent call last):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 651, in write
2014-04-09 19:02:39.974 7583 TRACE root     self._op_dec.write(*self._seg_dec.read())
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
2014-04-09 19:02:39.974 7583 TRACE root     if self.op.headers is None:
2014-04-09 19:02:39.974 7583 TRACE root AttributeError: 'NoneType' object has no attribute 'headers'
2014-04-09 19:02:39.974 7583 TRACE root
2014-04-09 19:02:39.974 7583 TRACE root
---

it kept logging similar output until it was killed by an admin.  From this point on it appears to have stopped processing rpc calls, with predictably bad consequences 

---
[root@host02 log]# qpid-stat -q -S msg
Queues
  queue                                     dur  autoDel  excl  msg    msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ==========================================================================================================================
  q-plugin                                                      59.6k  2.38m  2.32m   43.6m  1.73g    1.69g        0     2
  q-l3-plugin                                                     65   12.8k  12.7k   35.0k  7.05m    7.02m        0     2
---

Seems this has been noticed [1] and describes the problem.  As mentioned there is a review out

[1] https://bugs.launchpad.net/oslo/+bug/1303890
[2] https://review.openstack.org/85750

Comment 4 Alan Pevec 2014-04-22 10:40:34 UTC
*** Bug 1085997 has been marked as a duplicate of this bug. ***

Comment 12 Nir Magnezi 2014-05-19 09:21:27 UTC
Verified NVR: openstack-neutron-2013.2.3-7.el6ost

verified with neutron sanity tests.

Comment 15 errata-xmlrpc 2014-05-29 20:19:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html