Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1086077 - RPC error in neutron-server leads to cascading failure
RPC error in neutron-server leads to cascading failure
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
4.0
Unspecified Unspecified
high Severity high
: z4
: 4.0
Assigned To: Miguel Angel Ajo
Nir Magnezi
: ZStream
: 1085997 (view as bug list)
Depends On:
Blocks: 1086011
  Show dependency treegraph
 
Reported: 2014-04-10 00:33 EDT by Ian Wienand
Modified: 2016-04-26 12:16 EDT (History)
13 users (show)

See Also:
Fixed In Version: openstack-neutron-2013.2.3-6.el6ost
Doc Type: Bug Fix
Doc Text:
Cause: Under heavy AMQP load, an RPC error occurs in neutron-server. Consequence: Virtual machines lose network connection. Neutron-server doesn't respond to any RPC call, which causes a cascading failure in all of the compute-node's dhcp-agents, openvswitch-agent, etc. Fix: The Qpid reconnect methods have been fixed to capture more generic qpid_exceptions.MessagingError instead of qpid_exceptions.ConnectError. Result: This causes a message consumer to properly reconnect to Qpid after certain reconnection failures.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-05-29 16:19:43 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1303890 None None None Never
Red Hat Knowledge Base (Solution) 1147053 None None None Never
OpenStack gerrit 86502 None None None Never
Red Hat Product Errata RHSA-2014:0516 normal SHIPPED_LIVE Moderate: openstack-neutron security, bug fix, and enhancement update 2014-05-29 20:15:59 EDT

  None (edit)
Description Ian Wienand 2014-04-10 00:33:17 EDT
Description of problem:

We hit a problem in oslab which manifests as hosts not getting any network.  Eventually I traced it down to neutron-server not responding to RPC calls; this causes a cascading failure in all of the compute-node's dhcp-agents, openvswitch-agents, etc.

Going right back to the root cause, neutron-server first logged 

---
2014-04-09 19:02:10.901 7583 WARNING neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
2014-04-09 19:02:10.903 7583 WARNING neutron.db.agentschedulers_db [-] Fail scheduling network {'status': u'ACTIVE', 'subnets': [u'fc0eea49-5f08-4029-b806-ba1b61f1e7ba'], 'name': u'network1-net', 'provider:physical_network': None, 'admin_state_up': True, 'tenant_id':
+u'210a99a1e68f43218f4cab705c908d45', 'provider:network_type': u'gre', 'router:external': False, 'shared': False, 'id': u'1c8f57e6-fa66-4bba-90e3-1a735cab79e2', 'provider:segmentation_id': 46L}
2014-04-09 19:02:10.908 7583 WARNING neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api [-] Only 0 of 1 DHCP agents associated with network '1c8f57e6-fa66-4bba-90e3-1a735cab79e2' are marked as active, so notifications may be sent to inactive agents.
2014-04-09 19:02:15.165 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:19.356 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:21.220 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:25.363 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:26.684 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:28.296 7583 WARNING neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api [-] Only 0 of 1 DHCP agents associated with network '1c8f57e6-fa66-4bba-90e3-1a735cab79e2' are marked as active, so notifications may be sent to inactive agents.
2014-04-09 19:02:31.062 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:32.163 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:33.223 7583 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2
2014-04-09 19:02:39.974 7583 ERROR root [-] Unexpected exception occurred 61 time(s)... retrying.
2014-04-09 19:02:39.974 7583 TRACE root Traceback (most recent call last):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
2014-04-09 19:02:39.974 7583 TRACE root     return infunc(*args, **kwargs)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 686, in _consumer_thread
2014-04-09 19:02:39.974 7583 TRACE root     self.consume()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 677, in consume
2014-04-09 19:02:39.974 7583 TRACE root     it.next()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 594, in iterconsume
2014-04-09 19:02:39.974 7583 TRACE root     yield self.ensure(_error_callback, _consume)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 528, in ensure
2014-04-09 19:02:39.974 7583 TRACE root     return method(*args, **kwargs)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 585, in _consume
2014-04-09 19:02:39.974 7583 TRACE root     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "<string>", line 6, in next_receiver
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 660, in next_receiver
2014-04-09 19:02:39.974 7583 TRACE root     if self._ecwait(lambda: self.incoming, timeout):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-04-09 19:02:39.974 7583 TRACE root     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
2014-04-09 19:02:39.974 7583 TRACE root     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
2014-04-09 19:02:39.974 7583 TRACE root     self.check_error()
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 202, in check_error
2014-04-09 19:02:39.974 7583 TRACE root     raise self.error
2014-04-09 19:02:39.974 7583 TRACE root InternalError: Traceback (most recent call last):
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 651, in write
2014-04-09 19:02:39.974 7583 TRACE root     self._op_dec.write(*self._seg_dec.read())
2014-04-09 19:02:39.974 7583 TRACE root   File "/usr/lib/python2.6/site-packages/qpid/framing.py", line 269, in write
2014-04-09 19:02:39.974 7583 TRACE root     if self.op.headers is None:
2014-04-09 19:02:39.974 7583 TRACE root AttributeError: 'NoneType' object has no attribute 'headers'
2014-04-09 19:02:39.974 7583 TRACE root
2014-04-09 19:02:39.974 7583 TRACE root
---

it kept logging similar output until it was killed by an admin.  From this point on it appears to have stopped processing rpc calls, with predictably bad consequences 

---
[root@host02 log]# qpid-stat -q -S msg
Queues
  queue                                     dur  autoDel  excl  msg    msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ==========================================================================================================================
  q-plugin                                                      59.6k  2.38m  2.32m   43.6m  1.73g    1.69g        0     2
  q-l3-plugin                                                     65   12.8k  12.7k   35.0k  7.05m    7.02m        0     2
---

Seems this has been noticed [1] and describes the problem.  As mentioned there is a review out

[1] https://bugs.launchpad.net/oslo/+bug/1303890
[2] https://review.openstack.org/85750
Comment 4 Alan Pevec 2014-04-22 06:40:34 EDT
*** Bug 1085997 has been marked as a duplicate of this bug. ***
Comment 12 Nir Magnezi 2014-05-19 05:21:27 EDT
Verified NVR: openstack-neutron-2013.2.3-7.el6ost

verified with neutron sanity tests.
Comment 15 errata-xmlrpc 2014-05-29 16:19:43 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html

Note You need to log in before you can comment on or make changes to this bug.