Bug 1036523

Summary: neutron-server connection to multiple qpidd instances is broken
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: openstack-neutronAssignee: Ihar Hrachyshka <ihrachys>
Status: CLOSED ERRATA QA Contact: yfried
Severity: high Docs Contact:
Priority: high    
Version: 4.0CC: breeler, chrisw, fdinitto, fpercoco, gsim, kgiusti, lpeer, majopela, markmc, oblaut, srevivo, yeylon
Target Milestone: z4Keywords: OtherQA, Rebase, ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2013.2.3-4.el6ost Doc Type: Bug Fix
Doc Text:
Cause: Without a list of message brokers, openstack-neutron keeps trying to reconnect when there is a failed connection. Consequence: The service keeps trying to reconnect to the same broken message broker, even if there are several hosts configured. Fix: This change makes the reconnect() implementation to select the next broker in the list. Result: When several broker hosts are provided, it will try the next one in the list at every connection attempt. This means that non-failure reconnect attempts will also switch the current broker. Generally, users should not rely on any particular order when using brokers from the list, so this fix should not cause a problem.
Story Points: ---
Clone Of:
: 1082661 (view as bug list) Environment:
Last Closed: 2014-05-29 20:17:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1080561, 1082661, 1082664, 1082665, 1082666, 1082668, 1082669, 1082670, 1082672, 1123376    

Description Fabio Massimo Di Nitto 2013-12-02 09:06:25 UTC
This is related to #1036518

openstack-neutron-2013.2-10.el6ost.noarch

Since it appears that neutron cannot currently work with qpidd+haproxy, I configured neutron to use qpid_hosts for direct access to the qpidd cluster:

qpid_hosts = 192.168.2.179:5672,192.168.2.180:5672

This configuration has at least 2 problems, depending on the sequence of events.

1) both qpidd hosts are up and running:

2013-12-02 09:52:21.292 16524 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.2.179:5672

does not establish a connection to the second server. this means that neutron will neither loadbalance and will take longer to failover if .179 goes down.

2) qpidd on .129 is unreachable (simulated with iptables)

2013-12-02 09:58:09.480 16580 INFO neutron.plugins.openvswitch.ovs_neutron_plugin [-] Network VLAN ranges: {'eth1': [(1000, 2000)]}

it takes over one minute to do timeout detection.

2013-12-02 09:59:12.678 16580 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 110] ETIMEDOUT. Sleeping 1 seconds
2013-12-02 09:59:13.732 16580 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.2.180:5672

flush iptables on .179 and block .180

Iptable block was issued at:
Mon Dec  2 10:03:22 CET 2013

2013-12-02 10:05:13.736 16580 ERROR neutron.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 526, in ensure
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 583, in _consume
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid     nxt_receiver = self.session.next_receiver(timeout=timeout)
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "<string>", line 6, in next_receiver
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 660, in next_receiver
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid     if self._ecwait(lambda: self.incoming, timeout):
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid     result = self._ewait(lambda: self.closed or predicate(), timeout)
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid     self.check_error()
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 202, in check_error
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid     raise self.error
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid ConnectionError: connection aborted
2013-12-02 10:05:13.736 16580 TRACE neutron.openstack.common.rpc.impl_qpid 
2013-12-02 10:05:13.744 16580 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.2.179:5672

So it takes almost 2 minutes to do failure detection.

Comment 1 Mark McLoughlin 2013-12-02 09:11:07 UTC
Would be good to understand a couple of things here - (a) whether this is specific to Neutron (maybe it has an older copy of the RPC code, or maybe this is an issue for all services) and (b) whether the RabbitMQ support is any better

Comment 2 Fabio Massimo Di Nitto 2013-12-02 09:16:50 UTC
(In reply to Mark McLoughlin from comment #1)
> Would be good to understand a couple of things here - (a) whether this is
> specific to Neutron (maybe it has an older copy of the RPC code, or maybe
> this is an issue for all services) and (b) whether the RabbitMQ support is
> any better


http://rhel-ha.etherpad.corp.redhat.com/RHOS-RHEL-HA-how-to

and https://bugzilla.redhat.com/show_bug.cgi?id=1036518

have all the info to reproduce.

You will need to unleash somebody to test here != me :)

Comment 3 Maru Newby 2013-12-09 08:23:20 UTC
(In reply to Mark McLoughlin from comment #1)
> Would be good to understand a couple of things here - (a) whether this is
> specific to Neutron (maybe it has an older copy of the RPC code, or maybe
> this is an issue for all services) and (b) whether the RabbitMQ support is
> any better

a) All services are affected.

I've reviewed neutron's copy of openstack.common, the oslo-incubator repo, and the oslo.messaging repo. In all cases, both the qpid and rabbit drivers support only a single concurrent connection.  If provided, multiple hosts will be used to perform round-robin connection attempts (either at startup or on connection failure) until a successful connection is made.


b) RabbitMQ isn't better

- Both drivers immediately attempt to reconnect if their connection to the messaging service drops due to service failure.  
- Both drivers use an increasing reconnection interval to limit the number of client connection attempts in the event of a service disruption.
- Both drivers will detect communication failure on send and attempt to reconnect.

Only the qpid driver supports a connection liveness check.  The interval is configured via qpid_heartbeat (default=60).  The ~2m that it was taking for to failover was likely related to this interval.  Care is advised in reducing this interval, though, since it will likely increase the idle load on the qpidd service.

Comment 4 Maru Newby 2013-12-11 06:30:47 UTC
Mark: Can you please advise as to whether it should be considered a bug that openstack services are not capable of maintaining multiple connections to an AMQP service?  I am thinking it should be the responsibility of an HA implementation rather than something builtin to oslo.

Comment 5 Maru Newby 2013-12-11 06:34:11 UTC
In my testing, the Neutron service immediately attempts reconnection if the connection to qpid is broken by stopping qpidd.  Can you please verify this result (which would render symptom #2 invalid)?  I suspect that the use of iptables filtering was not accurately simulating service failure and leaving some open connections, requiring a heartbeat failure (default interval -> 60s) to discover the service disruption.

Comment 6 Fabio Massimo Di Nitto 2013-12-11 08:38:27 UTC
(In reply to Maru Newby from comment #4)
> Mark: Can you please advise as to whether it should be considered a bug that
> openstack services are not capable of maintaining multiple connections to an
> AMQP service?  I am thinking it should be the responsibility of an HA
> implementation rather than something builtin to oslo.

https://bugzilla.redhat.com/show_bug.cgi?id=1036518

this was indeed the original idea, but it appears that using haproxy to qpid instances is also broken.

Comment 7 Fabio Massimo Di Nitto 2013-12-11 08:44:08 UTC
(In reply to Maru Newby from comment #5)
> In my testing, the Neutron service immediately attempts reconnection if the
> connection to qpid is broken by stopping qpidd.

stopping qpid will close the TCP connection properly instead of triggering a timeout.

>  Can you please verify this
> result (which would render symptom #2 invalid)?  I suspect that the use of
> iptables filtering was not accurately simulating service failure and leaving
> some open connections, requiring a heartbeat failure (default interval ->
> 60s) to discover the service disruption.

Nope, you are comparing oranges and apples :)

You can“t predict that the qpid service will always shutdown properly and terminate the TCP connections correctly.

A network disconnection will trigger a timeout similar to iptables (try for example unplugging the network cable, this can be simulated also in virt environment by manually disconnecting the vm virt interface from the configured bridge)

I am 100% sure that my iptables filtering are correct. I use those same rules to perform lots of other tests.

qpid is on 192.168.1.1

neutron on 192.168.1.2

on qpid host make sure there are no iptables rules installed:

for rule in INPUT FORWARD OUTPUT; do
 iptables -F $rule
done

iptables -A INPUT -j DROP -s 192.168.1.2
iptables -A OUTPUT -j DROP -d 192.168.1.2

Comment 8 Mark McLoughlin 2013-12-11 18:27:13 UTC
This reminds me of this rabbitmq driver bug - https://bugs.launchpad.net/oslo/+bug/856764

I've lost track of the details of the debate, but the requirement is simple - you need some form of heatbeat or keepalive so that you quickly notice the connection has stalled and that you need to failover

If qpid_heartbeat=60 and we're taking 2 minutes to fail over, that suggests we're attempting to reconnect to the same host as our failed connection - it could be an easy fix to cut the failover time in half

And if 60 seconds is too slow a fail over time, we should reduce qpid_heatbeat to whatever is acceptable. There's no magic here - you can only know that the other side has gone away if it fails to respond to you.

Oh, please do open upstream oslo.messaging bugs for this as appropriate

Hope that helps ...

Comment 10 Perry Myers 2013-12-15 12:48:35 UTC
lpeer mentioned over email that this might be an HAProxy bug.  The comments on this bug don't convince me that this is a HA Proxy bug... 

It should be possible for a service to have a list of qpid brokers, perform liveness checks and failover in a round robin fashion.  At least, this is my understanding from multiple conversations with the qpid folks.  Requiring a load balancer in between to perform this round robin failover seems a bit heavyweight.

If the qpid driver in Oslo (copied to all of the services) can't properly handle broker failover this way, then I think that needs to be fixed.

Would like some input here from folks on the qpid team and/or folks familiar with the qpid drivers in OpenStack.

As a completely separate issues, as Comment #6 indicates, we were not able to get HAProxy to load balance qpid connections to work around the fact that the qpid broker round robin mechanism in the qpid driver itself was not working.  I think these two bugs are separate, and I don't think having to put HAProxy in between each OpenStack qpid client and qpid broker should be necessary, unless the qpid experts tell me that this is the only way to provide round-robin failover for qpid.

Comment 11 Fabio Massimo Di Nitto 2013-12-15 15:00:26 UTC
(In reply to Perry Myers from comment #10)
> lpeer mentioned over email that this might be an HAProxy bug.  The comments
> on this bug don't convince me that this is a HA Proxy bug... 

Neither bugs are haproxy really.

The bug mentioned in comment #6 is specific in combining the qpidd driver used by nova/neutron with haproxy to qpid

This is specific to the direct connection between neutron/nova to qpidd instances.

We have to make sure NOT to mix the two issues here (in fact, you mentioned the same below.. :))

> 
> It should be possible for a service to have a list of qpid brokers, perform
> liveness checks and failover in a round robin fashion.  At least, this is my
> understanding from multiple conversations with the qpid folks.  Requiring a
> load balancer in between to perform this round robin failover seems a bit
> heavyweight.

In fact, it should not be required. The problem with the current implementation, from a HA perspective, is the time it takes for the qpid driver in Oslo to recognize a server has gone away and reconnect to the next, and from a LB perspective, the fact that the connections are not load-balanced as expected.

> 
> If the qpid driver in Oslo (copied to all of the services) can't properly
> handle broker failover this way, then I think that needs to be fixed.
> 
> Would like some input here from folks on the qpid team and/or folks familiar
> with the qpid drivers in OpenStack.
> 
> As a completely separate issues, as Comment #6 indicates, we were not able
> to get HAProxy to load balance qpid connections to work around the fact that
> the qpid broker round robin mechanism in the qpid driver itself was not
> working.  I think these two bugs are separate, and I don't think having to
> put HAProxy in between each OpenStack qpid client and qpid broker should be
> necessary, unless the qpid experts tell me that this is the only way to
> provide round-robin failover for qpid.

Ok we agree. they are 2 separate issues.

The use of haproxy in between, makes it easier to failover and configure (and reconfigure) most of the services. Since we need to setup haproxy for another bug in glance <-> qpid, re-using the same channel for other services simplifies the overall configuration steps and keep them consistent across all systems/subsystems.

Comment 13 Gordon Sim 2013-12-16 12:01:25 UTC
(In reply to Mark McLoughlin from comment #8)
> This reminds me of this rabbitmq driver bug -
> https://bugs.launchpad.net/oslo/+bug/856764
> 
> I've lost track of the details of the debate, but the requirement is simple
> - you need some form of heatbeat or keepalive so that you quickly notice the
> connection has stalled and that you need to failover
> 
> If qpid_heartbeat=60 and we're taking 2 minutes to fail over, that suggests
> we're attempting to reconnect to the same host as our failed connection - it
> could be an easy fix to cut the failover time in half

The way heartbeat is defined for AMQP 0-10 is that the connection should be considered failed if *two* heartbeat intervals pass without any  heartbeats. SO if you want to detect a failure within 60 seconds, the heartbeat interval should be 30secs.

(Note also that the heartbeats are only sent if there is no other traffic on the connection).

Comment 14 Gordon Sim 2013-12-16 12:42:55 UTC
(In reply to Fabio Massimo Di Nitto from comment #11)
> (In reply to Perry Myers from comment #10)
> > It should be possible for a service to have a list of qpid brokers, perform
> > liveness checks and failover in a round robin fashion.  At least, this is my
> > understanding from multiple conversations with the qpid folks.  Requiring a
> > load balancer in between to perform this round robin failover seems a bit
> > heavyweight.
> 
> In fact, it should not be required. The problem with the current
> implementation, from a HA perspective, is the time it takes for the qpid
> driver in Oslo to recognize a server has gone away and reconnect to the
> next, and from a LB perspective, the fact that the connections are not
> load-balanced as expected.

The time to detect failover is controlled using the heartbeat interval, with two intervals passing before the connection is treated as failed.

In terms of load-balancing, that would only work if a clustered qpidd is used. In that case, though load balancing will have some impact, the overall throughput is generally noticeably reduced as compared to a single standalone instance anyway. If instead pacemaker is used to start a backup unclustered qpidd, then load balancing isn't relevant.

Comment 15 Fabio Massimo Di Nitto 2013-12-16 13:30:16 UTC
(In reply to Gordon Sim from comment #14)
> (In reply to Fabio Massimo Di Nitto from comment #11)
> > (In reply to Perry Myers from comment #10)
> > > It should be possible for a service to have a list of qpid brokers, perform
> > > liveness checks and failover in a round robin fashion.  At least, this is my
> > > understanding from multiple conversations with the qpid folks.  Requiring a
> > > load balancer in between to perform this round robin failover seems a bit
> > > heavyweight.
> > 
> > In fact, it should not be required. The problem with the current
> > implementation, from a HA perspective, is the time it takes for the qpid
> > driver in Oslo to recognize a server has gone away and reconnect to the
> > next, and from a LB perspective, the fact that the connections are not
> > load-balanced as expected.
> 
> The time to detect failover is controlled using the heartbeat interval, with
> two intervals passing before the connection is treated as failed.
> 
> In terms of load-balancing, that would only work if a clustered qpidd is
> used.

It is the case.

> In that case, though load balancing will have some impact, the overall
> throughput is generally noticeably reduced as compared to a single
> standalone instance anyway.

right now i am more worried about basic working vs performances.

> If instead pacemaker is used to start a backup
> unclustered qpidd, then load balancing isn't relevant.

Yes, I am filing those bugs as I find issues deploying RHOS on top of RHEL-HA/LB.

Comment 16 Maru Newby 2013-12-17 06:33:06 UTC
(In reply to Mark McLoughlin from comment #8)
> This reminds me of this rabbitmq driver bug -
> https://bugs.launchpad.net/oslo/+bug/856764
> 
> I've lost track of the details of the debate, but the requirement is simple
> - you need some form of heatbeat or keepalive so that you quickly notice the
> connection has stalled and that you need to failover
> 
> If qpid_heartbeat=60 and we're taking 2 minutes to fail over, that suggests
> we're attempting to reconnect to the same host as our failed connection - it
> could be an easy fix to cut the failover time in half
> 
> And if 60 seconds is too slow a fail over time, we should reduce
> qpid_heatbeat to whatever is acceptable. There's no magic here - you can
> only know that the other side has gone away if it fails to respond to you.
> 
> Oh, please do open upstream oslo.messaging bugs for this as appropriate
> 
> Hope that helps ...

Your assumptions as to the failover time were correct. Both the rabbit and qpid drivers reconnect in the order in which the AMQP servers are defined rather than connecting to the next server that did not fail.
 
I've filed a launchpad bug as requested and added a reference to this bz.

Comment 17 Maru Newby 2013-12-17 06:46:53 UTC
(In reply to Gordon Sim from comment #13)
> (In reply to Mark McLoughlin from comment #8)
> > This reminds me of this rabbitmq driver bug -
> > https://bugs.launchpad.net/oslo/+bug/856764
> > 
> > I've lost track of the details of the debate, but the requirement is simple
> > - you need some form of heatbeat or keepalive so that you quickly notice the
> > connection has stalled and that you need to failover
> > 
> > If qpid_heartbeat=60 and we're taking 2 minutes to fail over, that suggests
> > we're attempting to reconnect to the same host as our failed connection - it
> > could be an easy fix to cut the failover time in half
> 
> The way heartbeat is defined for AMQP 0-10 is that the connection should be
> considered failed if *two* heartbeat intervals pass without any  heartbeats.
> SO if you want to detect a failure within 60 seconds, the heartbeat interval
> should be 30secs.

Ah, that makes more sense (it pays to rtfm I guess).  So the 2m was only due to the timeout interval and had nothing to do with attempting to connect to the failed server (since the failure would be reported immediately)?  If so, the upstream bug is invalid and should be marked as such.


> 
> (Note also that the heartbeats are only sent if there is no other traffic on
> the connection).

Comment 18 Gordon Sim 2013-12-17 11:27:09 UTC
(In reply to Maru Newby from comment #17)
> Ah, that makes more sense (it pays to rtfm I guess).  So the 2m was only due
> to the timeout interval and had nothing to do with attempting to connect to
> the failed server (since the failure would be reported immediately)?  If so,
> the upstream bug is invalid and should be marked as such.

With a heartbeat interval of 60 seconds, qpid will take 2 minutes to give up on a connection over which it receives no heartbeats.

I'm not familiar enough with all the various codebases to authoritatively say whether the reconnect logic in the driver is correct. However, looking at Connection.reconnect() in olso.messaging, it would appear that that will always try each of the defined brokers, starting always with the first in the list[1] and waiting for an increasing delay between failures (delay starts at 1 second and doubles up to a maximum of 60 seconds).

The nova.openstack.common.rpc.impl_qpid.py appears to have the same logic in it also.

So to me, https://bugs.launchpad.net/oslo.messaging/+bug/1261631 appears to accurately describe an issue.

In the original description of this bz above, I'm not clear from the log snippets what exactly happens for the first failure. It looks like on reconnect the first broker is tried and fails, so (following a 1 second sleep) it tries and succeeds to connect to the second. The second failure is then detected in 2 minutes (which is the time expected for a failure detected due to lack of heartbeats with a heartbeat interval of 60, however the error is 'connection aborted' whereas I would have expected 'heartbeat timeout'). The next reconnect starts (as always) from the first broker (which is now available again, and happens to be correct for this test).

Note also that I discovered a defect in the qpid python client, where the heartbeat only works for established connections and would not ensure that a connection attempt would timeout: https://issues.apache.org/jira/browse/QPID-5428.

[1]
   attempt = 0
   ...
   while True:
       ...
       broker = self.brokers[attempt % len(self.brokers)]

Comment 19 Ihar Hrachyshka 2014-01-16 18:00:15 UTC
There are several problems mentioned here:
1. oslo.messaging qpid implementation does not do any load-balancing (so there is only one connection each moment of time, messages are not balanced thru qpid cluster). This should be discussed whether we really want to load-balance (not just failover) qpid messages.
2. 'connection aborted' instead of expected 'timeout'. This is hidden inside qpid python bindings. See: /usr/lib/python2.7/site-packages/qpid/connection.py. I should investigate more on why we don't get timeout error if heartbeat timeout is triggered.
3. On failover, we always restart from the start of broker list. Same as: https://bugs.launchpad.net/oslo.messaging/+bug/1261631. I have a fix for oslo-incubator for this (qpid part only), I'll handle review process for this. I'm curious whether we want to also update rabbitmq implementation.
4. timeout detection takes 2 mins which is too long. This should be fixed by setting proper value for qpid_heartbeat and probably does not require any change in default configuration file (?)

Comment 20 Ihar Hrachyshka 2014-01-21 12:15:50 UTC
On problem 2. from comment #19: I actually get the following traceback when heartbeat fails, so there is probably no problem here:

2014-01-21 12:40:43.845 14523 ERROR neutron.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: heartbeat timeout
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 527, in ensure
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 584, in _consume
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid     nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "<string>", line 6, in next_receiver
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 660, in next_receiver
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid     if self._ecwait(lambda: self.incoming, timeout):
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid     result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid     self.check_error()
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 202, in check_error
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid     raise self.error
2014-01-21 12:40:43.845 14523 TRACE neutron.openstack.common.rpc.impl_qpid HeartbeatTimeout: heartbeat timeout

Comment 21 lpeer 2014-02-10 09:13:22 UTC
pushing for A3 as this is HA related and HA would be delivered in A3. In addition we have u/s issues with getting the patch merged -
http://post-office.corp.redhat.com/archives/rh-openstack-dev/2014-January/msg00675.html

Comment 24 errata-xmlrpc 2014-05-29 20:17:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html