Bug 1127843 - can't create consumer connection to qpid after HeartbeatTimeout in heavy workload
Summary: can't create consumer connection to qpid after HeartbeatTimeout in heavy work...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 4.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 5.0 (RHEL 7)
Assignee: Russell Bryant
QA Contact: Ami Jeain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-07 16:40 UTC by Jon Thomas
Modified: 2019-09-09 17:16 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-08-13 15:07:49 UTC
jthomas: needinfo-


Attachments (Terms of Use)

Description Jon Thomas 2014-08-07 16:40:18 UTC
https://bugs.launchpad.net/nova/+bug/1247603

VMs are intermittently failing to launch. Daemons (scheduler, conductor, etc) are showing:

2014-08-07 10:02:08.234 14436 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: heartbeat timeout
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "<string>", line 6, in next_receiver
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid     if self._ecwait(lambda: self.incoming, timeout):
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid     self.check_error()
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid     raise self.error
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid HeartbeatTimeout: heartbeat timeout
2014-08-07 10:02:08.234 14436 TRACE nova.openstack.common.rpc.impl_qpid 
2014-08-07 10:02:21.064 14436 WARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 2.817648 sec


and
2014-07-29 14:01:55.573 ERROR nova.openstack.common.rpc.impl_qpid [req-ad1bc7a2-d986-41ce-a1ec-30b9a380c20b None None] Failed to publish message to topic 'reply_ef197e6b510e4f3e9d742761ac9c0457': heartbeat timeout
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 424, in ensure
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 516, in _publisher_send
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     publisher = cls(self.conf, self.session, topic)
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 289, in __init__
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     {"type": "direct"})
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 246, in __init__
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     self.reconnect(session)
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 250, in reconnect
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     self.sender = session.sender(self.address)
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "<string>", line 6, in sender
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 597, in sender
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     sender._ewait(lambda: sender.linked)
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 804, in _ewait
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     result = self.session._ewait(lambda: self.error or predicate(), timeout)
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     self.check_error()
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid     raise self.error
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid HeartbeatTimeout: heartbeat timeout
2014-07-29 14:01:55.573 1663 TRACE nova.openstack.common.rpc.impl_qpid 

and

2014-08-06 10:02:27.623 21332 ERROR nova.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 540, in ensure
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 597, in _consume
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "<string>", line 6, in next_receiver
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 665, in next_receiver
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid     if self._ecwait(lambda: self.incoming, timeout):
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 571, in _ewait
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 214, in _ewait
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid     self.check_error()
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 207, in check_error
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid     raise self.error
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid ConnectionError: connection aborted
2014-08-06 10:02:27.623 21332 TRACE nova.openstack.common.rpc.impl_qpid

Comment 2 Russell Bryant 2014-08-11 13:53:36 UTC
What is the qpid_heartbeat configuration option set to?

Comment 3 Jon Thomas 2014-08-11 13:56:43 UTC
It's the default


Here's the contents of qpidd.conf

cluster-mechanism=ANONYMOUS
auth=no
port=5672
max-connections=4096
worker-threads=500
connection-backlog=4096
max-negotiate-time=30000

Comment 6 Russell Bryant 2014-08-11 15:06:05 UTC
OK, so this is likely 1 of 2 things:

1) The nova service that produced these errors was under very high load.  In this case, there's nothing we can really do about that in code.  It's a deployment issue (need to scale out).

2) The qpid broker was choking for some reason.  Do you have any info on what kind of load qpidd is under?  One recommendation I provided elsewhere for OS1 is to set qpid_topology_version=2 across the deployment.  Without that, there is a bad leak that nova is causing in qpidd, which will cause it to fall over after some period of time.  I suggest deploying this update and then seeing if this happens again.

Comment 7 Russell Bryant 2014-08-13 15:07:49 UTC
(In reply to Russell Bryant from comment #6)
> OK, so this is likely 1 of 2 things:
> 
> 1) The nova service that produced these errors was under very high load.  In
> this case, there's nothing we can really do about that in code.  It's a
> deployment issue (need to scale out).
> 
> 2) The qpid broker was choking for some reason.  Do you have any info on
> what kind of load qpidd is under?  One recommendation I provided elsewhere
> for OS1 is to set qpid_topology_version=2 across the deployment.  Without
> that, there is a bad leak that nova is causing in qpidd, which will cause it
> to fall over after some period of time.  I suggest deploying this update and
> then seeing if this happens again.

With that said, I don't think there is much else we can do here.  Let me know if we can help further.


Note You need to log in before you can comment on or make changes to this bug.