Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1036518 - [Docs] [HA] OpenStack mysql/qpid driver and haproxy timers need to be carefully documented
[Docs] [HA] OpenStack mysql/qpid driver and haproxy timers need to be careful...
Status: CLOSED DEFERRED
Product: Red Hat OpenStack
Classification: Red Hat
Component: doc-Installation_and_Configuration_Guide (Show other bugs)
4.0
Unspecified Unspecified
high Severity high
: ---
: 5.0 (RHEL 7)
Assigned To: Martin Lopes
ecs-bugs
: Documentation, Triaged
Depends On:
Blocks: 1220653
  Show dependency treegraph
 
Reported: 2013-12-02 03:29 EST by Fabio Massimo Di Nitto
Modified: 2016-04-26 23:22 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-04-28 21:47:30 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
haproxy configuration which works ok for my testing (3.29 KB, text/plain)
2014-01-16 08:09 EST, Ihar Hrachyshka
no flags Details

  None (edit)
Description Fabio Massimo Di Nitto 2013-12-02 03:29:36 EST
The setup is the following:

2 cluster nodes running haproxy
2 cluster nodes running mysql/qpid/keystone/glance/cinder
2 cluster nodes running neutron (or at least an attempt to run ;))

configuration details can be found here:
http://rhel-ha.etherpad.corp.redhat.com/RHOS-RHEL-HA-how-to

haproxy has been tested both in tcp and httpd mode with similar results.

The logical setup is:

1 machine running neutron-server -> virtual IP from the loadbalancer -> connection forwarded to one of the 2 nodes running qpid.

from neutron.conf:
qpid_hostname = rhos4-qpidd-vip

[root@rhos4-node6 ~]# /etc/init.d/neutron-server start
Starting neutron:                                          [  OK  ]

2013-12-02 09:27:25.477 16471 ERROR neutron.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid Traceback (most recent call last):
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 526, in ensure
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid     return method(*args, **kwargs)
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/impl_qpid.py", line 583, in _consume
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid     nxt_receiver = self.session.next_receiver(timeout=timeout)
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "<string>", line 6, in next_receiver
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 660, in next_receiver
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid     if self._ecwait(lambda: self.incoming, timeout):
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid     result = self._ewait(lambda: self.closed or predicate(), timeout)
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid     self.check_error()
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid   File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 202, in check_error
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid     raise self.error
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid ConnectionError: connection aborted
2013-12-02 09:27:25.477 16471 TRACE neutron.openstack.common.rpc.impl_qpid 
2013-12-02 09:27:25.571 16471 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on rhos4-qpidd-vip:5672
2013-12-02 09:27:35.630 16471 ERROR neutron.openstack.common.rpc.impl_qpid [-] Failed to consume message from queue: connection aborted

Using direct connection to the qpidd server allows -server to start and we don´t see the tracebacks, but it has other issues (i´ll file them shortly after this one and reference)
Comment 1 Fabio Massimo Di Nitto 2013-12-02 03:31:40 EST
openstack-neutron-2013.2-10.el6ost.noarch
Comment 2 Fabio Massimo Di Nitto 2013-12-02 04:15:51 EST
(In reply to Fabio Massimo Di Nitto from comment #0)
queue: connection aborted
> 
> Using direct connection to the qpidd server allows -server to start and we
> don´t see the tracebacks, but it has other issues (i´ll file them shortly
> after this one and reference)

https://bugzilla.redhat.com/show_bug.cgi?id=1036523
Comment 3 Maru Newby 2013-12-09 03:40:56 EST
Can you please confirm whether other services (nova, for example) exhibit this same issue?  Given that the underlying RPC mechanism is shared, I think it unlikely that this issue is specific to neutron.
Comment 4 Fabio Massimo Di Nitto 2013-12-09 03:45:44 EST
(In reply to Maru Newby from comment #3)
> Can you please confirm whether other services (nova, for example) exhibit
> this same issue?  Given that the underlying RPC mechanism is shared, I think
> it unlikely that this issue is specific to neutron.

I am not done clustering nova yet, but Steven Reichard had similar issues so i suspect the common code is all affected. I might be able to finish by today or tomorrow unless i find other blockers down the road.
Comment 5 Fabio Massimo Di Nitto 2013-12-09 05:50:28 EST
(In reply to Fabio Massimo Di Nitto from comment #4)
> (In reply to Maru Newby from comment #3)
> > Can you please confirm whether other services (nova, for example) exhibit
> > this same issue?  Given that the underlying RPC mechanism is shared, I think
> > it unlikely that this issue is specific to neutron.
> 
> I am not done clustering nova yet, but Steven Reichard had similar issues so
> i suspect the common code is all affected. I might be able to finish by
> today or tomorrow unless i find other blockers down the road.

Confirmed, same problem happens in nova.
Comment 6 Maru Newby 2013-12-11 01:01:13 EST
Thank you for the confirmation.  Would it then make sense to retarget the bug so that someone more familiar with QPID and OpenStack's use of it could be tasked with finding a fix?  I'm afraid there is nobody on the RHOS Networking team with a specialization in HA QPID.
Comment 7 Fabio Massimo Di Nitto 2013-12-11 01:19:37 EST
(In reply to Maru Newby from comment #6)
> Thank you for the confirmation.  Would it then make sense to retarget the
> bug so that someone more familiar with QPID and OpenStack's use of it could
> be tasked with finding a fix?  I'm afraid there is nobody on the RHOS
> Networking team with a specialization in HA QPID.

The root problem is not related to HA QPID itself but in passing through haproxy.

You can reproduce the issue without any HA involvement:

neutron-server -> haproxy -> qpid

vs

neutron-server -> qpid

Note that haproxy is configured in tcp mode, meaning that there is no mingling with http headers in the requests.

If you use 3 VMs and setup a service as above, you can easily see the same results, without involving cluster complexity.
Comment 8 Ryan O'Hara 2013-12-15 18:39:46 EST
I'd like to see haproxy.cfg as well as some haproxy logs. Is haproxy able to perform successful health-checks for the backend qpid servers?
Comment 9 Fabio Massimo Di Nitto 2013-12-16 00:12:17 EST
(In reply to Ryan O'Hara from comment #8)
> I'd like to see haproxy.cfg as well as some haproxy logs. Is haproxy able to
> perform successful health-checks for the backend qpid servers?

It's the same haproxy config I am using for HA setup. Yes haproxy is working fine. Glance is using that same VIP/session and it works without a glitch.

http://rhel-ha.etherpad.corp.redhat.com/RHOS-RHEL-HA-how-to

config bit is there.

I strongly doubt it is a haproxy issue here tho. the qpid-python-test suite executed against the VIP, from the same host where neutron is running, claims 100% test pass.
Comment 10 Fabio Massimo Di Nitto 2013-12-16 08:56:27 EST
So I retested today, with the exact same config (beside moving qpid_hosts to qpid_hostname) and it start working.

Tho I noticed other issues, such as duplicated messages and messages that could not be consumed.

I would recommend to merge this bug with https://bugzilla.redhat.com/show_bug.cgi?id=1036523 and have a more concrete discussion on the various parts that are working/not working and plan actions from there.

My personal feeling, given the above experience (note that haproxy was not working for other people either), is that the code is not stable or very reliable. It needs some serious review/adjustment for production.
Comment 11 Ryan O'Hara 2013-12-16 09:56:35 EST
(In reply to Fabio Massimo Di Nitto from comment #9)
> (In reply to Ryan O'Hara from comment #8)
> > I'd like to see haproxy.cfg as well as some haproxy logs. Is haproxy able to
> > perform successful health-checks for the backend qpid servers?
> 
> It's the same haproxy config I am using for HA setup. Yes haproxy is working
> fine. Glance is using that same VIP/session and it works without a glitch.
> 
> http://rhel-ha.etherpad.corp.redhat.com/RHOS-RHEL-HA-how-to
> 
> config bit is there.
> 
> I strongly doubt it is a haproxy issue here tho. the qpid-python-test suite
> executed against the VIP, from the same host where neutron is running,
> claims 100% test pass.

OK. This is good to know. I agree that this does not appear to be an haproxy config issue.
Comment 12 Gordon Sim 2013-12-17 06:35:10 EST
Were there any errors for qpidd in syslog? I believe the 'connection aborted' implies the client hit the end of stream for the socket. That could be because the broker dropped the connection for some reason (which should then be logged).
Comment 13 Maru Newby 2013-12-17 13:00:44 EST
(In reply to Fabio Massimo Di Nitto from comment #10)
> So I retested today, with the exact same config (beside moving qpid_hosts to
> qpid_hostname) and it start working.
> 
> Tho I noticed other issues, such as duplicated messages and messages that
> could not be consumed.
> 
> I would recommend to merge this bug with
> https://bugzilla.redhat.com/show_bug.cgi?id=1036523 and have a more concrete
> discussion on the various parts that are working/not working and plan
> actions from there.

Can you please close this bug or the other one?  It's not clear from your message which you prefer.
Comment 14 Fabio Massimo Di Nitto 2013-12-17 13:17:09 EST
Gordon: yes I saw that at some point qpid reported that the client connection has timeout and it was closing the connection. Sometime with and sometime without haproxy.

Maru: I am still investigating some cases here. It´s not completely black and white what´s happening.

Truth told it´s very confusing as there is no specific reproducer for one problem or another.

<speculation>
 It feels like as if there are some odd race conditions between qpidd message TTL and clients consuming messages from the queues.
</speculation>

I´d like a few more days to try and consolidate some fact-matrix and then we can agree which bug to close.
Comment 15 Fabio Massimo Di Nitto 2013-12-24 10:35:05 EST
https://bugzilla.redhat.com/show_bug.cgi?id=1036748#c5

cross referencing.

We will need a small squad from both qpid and RHOS team to debug this issue. It's a lot more complex than it appears and so far we were able to obtain 4 different results (tracebacks in python) depending on direct connections to qpid or via haproxy and qpidd timers.
Comment 16 Graeme Gillies 2014-01-13 22:34:48 EST
Hi,

We are hitting this issue as well, seeing the error

2014-01-14 03:04:52.393 9816 TRACE neutron.openstack.common.rpc.impl_qpid ConnectionError: connection aborted

For all our neturon agents on our compute/server/network nodes. Our QPID instance is not clustered at all, and not behind haproxy. It is however, listening on a VIP.

Setting the qpid server to debug gives very little to no information

2014-01-14 02:56:47 [Broker] error Connection 10.8.56.20:5672-10.8.57.1:39914 timed out: closing
2014-01-14 02:56:47 [System] debug DISCONNECTED [10.8.56.20:5672-10.8.57.1:39914]
2014-01-14 02:56:47 [Broker] debug openstack@openstack.84c8f187-adf9-40b0-8734-ce7f06e18da8:0: detached on broker.
2014-01-14 02:56:47 [Management] debug SEND raiseEvent (v1) class=org.apache.qpid.broker.clientDisconnect
2014-01-14 02:56:47 [Management] debug SEND raiseEvent (v2) class=org.apache.qpid.broker.clientDisconnect
2014-01-14 02:56:47 [Model] debug Delete connection. user:openstack@openstack rhost:10.8.56.20:5672-10.8.57.1:39914

This affects all part of our infrastructure and severely impacts usability of the environment

Regards,

Graeme
Comment 17 Ihar Hrachyshka 2014-01-15 10:35:33 EST
My understanding is that 'connection aborted' error seen may indicate different issues in different contexts:
- in direct neutron-to-qpid connection case, it may be firewall configuration problem.
- in haproxy-proxied neutron-to-qpid connection, it may be (again) firewall configuration problem on both haproxy and qpid side (it should allow connections from haproxy node), or qpid is down/malfunctioning.

So putting all cases of the error message into one single bug does not help investigating specific cases. Instead, I discuss only haproxy matter below, leaving setups with no haproxy behind the scope.

===========

I've successfully set two nodes of qpid accessed by neutron thru haproxy as following:
- created allinone packstack installation on Node1;
- installed haproxy and keepalived on Node2;
- installed Qpid on Node3;
- configured haproxy on Node2 to switch (round-robin) Qpid nodes and serve on VIP;
- configured neutron on Node1 to use VIP;
- updated firewall rules on Node2 and Node3 to allow incoming connections on the path to and from neutron.

More info on setup at: http://openstack.redhat.com/Load_Balance_OpenStack_API

===========

Using default timeout values for haproxy watchdog (10 secs) and oslo Qpid implementation (60 secs, which are multiplied by two) seem too much to quickly remove/restore Qpid nodes when they become offline or back online.

Using 'round-robin' mode for haproxy, it means that failed Qpid node will be used until haproxy detects the fact, which will take some time. At the moment, we'll see 'connection aborted' for some of our Qpid connections. We can quicken failed node detection by changing haproxy timeout to e.g. 1 sec.

Default value for oslo qpid implementation timeout (60 secs) does not help either. We may need to wait extended time before it will retry connection (it wait for 1 sec before retrying, then 2, then 4... till 60 secs for several times, and only then repeating connection to other node; in case of one qpid VIP - the same one).

This issue may be reduced by lowering those timeouts significantly. Setting the values to 1 or 2 secs make haproxy and oslo qpid implementation to switch node almost immediately, so that we see no or few error messages in the log.


===========

So on first sight, it may look like a simple configuration problem. But let's look at the issue from fundamental point of view.

Putting several Qpid nodes behind haproxy does not magically make them interact and maintain common knowledge about queues and exchanges and messages. If we use 'round-robin' haproxy mode, it will switch the node each time a new connection arrives. This will eventually result in a situation when a sender sent a message to one node, while a consumer expected it on another node, so message will be transmitted delayed or even dropped [if no one subscribed to the exchange on this node yet). Not a good thing.

Any other haproxy balancing mode will produce similar results. Even if haproxy somehow handle all the incoming connections by the same node, once it fails, and another node starts serving requests, we loose messages and other Qpid state from the first node.

My understanding is that generic approaches to clusterizing are not applicable to message queues. To get consistent results with high availability, we should maintained message queues state synchronized. This is exactly the thing that is done by Qpid clusters: http://qpid.apache.org/books/0.7/AMQP-Messaging-Broker-CPP-Book/html/ch01s08.html Of course, this is not cheep. Quoting, "High Availability Clustering has a cost: in order to allow each broker in a cluster to continue the work of any other broker, a cluster must replicate state for all brokers in the cluster." But this is exactly the approach we should take to clusterize our broker.

Meaning, even changes to timeout values in Qpid and haproxy only hide the real problem which is lack of state replication between independent qpid instances behind haproxy.
Comment 18 Fabio Massimo Di Nitto 2014-01-16 03:38:35 EST
(In reply to Ihar Hrachyshka from comment #17)
> My understanding is that 'connection aborted' error seen may indicate
> different issues in different contexts:
> - in direct neutron-to-qpid connection case, it may be firewall
> configuration problem.
> - in haproxy-proxied neutron-to-qpid connection, it may be (again) firewall
> configuration problem on both haproxy and qpid side (it should allow
> connections from haproxy node), or qpid is down/malfunctioning.

In my setup (with haproxy) all firewalls/iptables are off.

> 
> So putting all cases of the error message into one single bug does not help
> investigating specific cases. Instead, I discuss only haproxy matter below,
> leaving setups with no haproxy behind the scope.
> 
> ===========
> 
> I've successfully set two nodes of qpid accessed by neutron thru haproxy as
> following:
> - created allinone packstack installation on Node1;
> - installed haproxy and keepalived on Node2;
> - installed Qpid on Node3;
> - configured haproxy on Node2 to switch (round-robin) Qpid nodes and serve
> on VIP;
> - configured neutron on Node1 to use VIP;
> - updated firewall rules on Node2 and Node3 to allow incoming connections on
> the path to and from neutron.
> 
> More info on setup at: http://openstack.redhat.com/Load_Balance_OpenStack_API
> 

We followed the same tutorial, in fact, Ryan that wrote the tutorial is following our daily calls where we work on RHOS+RHEL-HA/LB.

In some cases it works, in other it doesn´t and it appears to be some what different from case to case. As I mentioned in comment #15, we have seen anything ranging from working to 4 different kinds of tracebacks so far.

I am actually surprised that the tutorial did work out of the box for you because it has a fundamental error in the configuration by enforcing http mode in the proxy section that just doesn´t work with qpid. tcp mode has to be used otherwise some of the headers rewrite will break. Perhaps you want to send me your final haproxy.cfg so i can compare and test? It´s entirely possible that you have that tiny little change that makes everything work like a champ :)

> ===========
> 
> Using default timeout values for haproxy watchdog (10 secs) and oslo Qpid
> implementation (60 secs, which are multiplied by two) seem too much to
> quickly remove/restore Qpid nodes when they become offline or back online.
> 
> Using 'round-robin' mode for haproxy, it means that failed Qpid node will be
> used until haproxy detects the fact, which will take some time. At the
> moment, we'll see 'connection aborted' for some of our Qpid connections. We
> can quicken failed node detection by changing haproxy timeout to e.g. 1 sec.
> 
> Default value for oslo qpid implementation timeout (60 secs) does not help
> either. We may need to wait extended time before it will retry connection
> (it wait for 1 sec before retrying, then 2, then 4... till 60 secs for
> several times, and only then repeating connection to other node; in case of
> one qpid VIP - the same one).
> 
> This issue may be reduced by lowering those timeouts significantly. Setting
> the values to 1 or 2 secs make haproxy and oslo qpid implementation to
> switch node almost immediately, so that we see no or few error messages in
> the log.
> 

ACK that´s one of the next steps in testing failover timers.

> 
> ===========
> 
> So on first sight, it may look like a simple configuration problem. But
> let's look at the issue from fundamental point of view.
> 
> Putting several Qpid nodes behind haproxy does not magically make them
> interact and maintain common knowledge about queues and exchanges and
> messages. If we use 'round-robin' haproxy mode, it will switch the node each
> time a new connection arrives. This will eventually result in a situation
> when a sender sent a message to one node, while a consumer expected it on
> another node, so message will be transmitted delayed or even dropped [if no
> one subscribed to the exchange on this node yet). Not a good thing.

qpid was configured in cluster mode. All messages are dispatched to/from all nodes.

> 
> Any other haproxy balancing mode will produce similar results. Even if
> haproxy somehow handle all the incoming connections by the same node, once
> it fails, and another node starts serving requests, we loose messages and
> other Qpid state from the first node.
> 
> My understanding is that generic approaches to clusterizing are not
> applicable to message queues. To get consistent results with high
> availability, we should maintained message queues state synchronized. This
> is exactly the thing that is done by Qpid clusters:
> http://qpid.apache.org/books/0.7/AMQP-Messaging-Broker-CPP-Book/html/ch01s08.
> html Of course, this is not cheep. Quoting, "High Availability Clustering
> has a cost: in order to allow each broker in a cluster to continue the work
> of any other broker, a cluster must replicate state for all brokers in the
> cluster." But this is exactly the approach we should take to clusterize our
> broker.
> 
> Meaning, even changes to timeout values in Qpid and haproxy only hide the
> real problem which is lack of state replication between independent qpid
> instances behind haproxy.

This is the same setup we have. The cost is hardly a concern right now.

There are other threads on rhos related mailing lists discussing how qpid should be deployed.
Comment 19 Ihar Hrachyshka 2014-01-16 08:08:22 EST
Initially, I didn't realize that you actually clustered qpidd instances. If so, then we look just at haproxy as a load balancer between two qpidd servers with synchronized state, and this states for completely valid case.

Indeed, I use 'tcp' haproxy mode to load balance between qpidd instances. 'http' mode indeed does not work.

Since in this case, there is only one qpid_hostname used by neutron, qpid_timeout may influence only the case when qpidd instance is killed in a hard way, not just by stopping the service (which closes connection correctly and informs neutron immediately about the fact). F.e. blocking qpidd by iptables stands for such intrusive way.

When a qpidd node fails, we need to make sure that:
1. haproxy quickly detects the fact and removes the node from the list of active candidates for load balancing.
2. neutron (or any other openstack service using oslo.messaging qpid implementation) quickly detects the fact and tries to reconnect.

The 1st point is achieves by lowering watchdog timeout in haproxy.conf:

backend qpid
    balance roundrobin
    server allinone 10.34.60.119:5672 check inter 2s
    server qpidd 10.34.61.102:5672 check inter 2s

The 2nd point is handled by qpid_heartbeat value. Setting it to 2 secs makes switch almost instant.

So I suggest you testing with other heartbeat values on both neutron qpid and haproxy sides and see whether it helps.

Attaching my haproxy.conf for your reference.
Comment 20 Ihar Hrachyshka 2014-01-16 08:09:50 EST
Created attachment 851047 [details]
haproxy configuration which works ok for my testing
Comment 21 Ryan O'Hara 2014-01-16 10:04:58 EST
(In reply to Ihar Hrachyshka from comment #19)
> Initially, I didn't realize that you actually clustered qpidd instances. If
> so, then we look just at haproxy as a load balancer between two qpidd
> servers with synchronized state, and this states for completely valid case.
> 
> Indeed, I use 'tcp' haproxy mode to load balance between qpidd instances.
> 'http' mode indeed does not work.
> 
> Since in this case, there is only one qpid_hostname used by neutron,
> qpid_timeout may influence only the case when qpidd instance is killed in a
> hard way, not just by stopping the service (which closes connection
> correctly and informs neutron immediately about the fact). F.e. blocking
> qpidd by iptables stands for such intrusive way.
> 
> When a qpidd node fails, we need to make sure that:
> 1. haproxy quickly detects the fact and removes the node from the list of
> active candidates for load balancing.
> 2. neutron (or any other openstack service using oslo.messaging qpid
> implementation) quickly detects the fact and tries to reconnect.
> 
> The 1st point is achieves by lowering watchdog timeout in haproxy.conf:
> 
> backend qpid
>     balance roundrobin
>     server allinone 10.34.60.119:5672 check inter 2s
>     server qpidd 10.34.61.102:5672 check inter 2s
> 
> The 2nd point is handled by qpid_heartbeat value. Setting it to 2 secs makes
> switch almost instant.

Your health check is a simple TCP connect that occurs every 2 seconds and has a 1 sec timeout. This should retry 3 times before marking the backend server as unavailable. I'm not sure if redispatch option has any effect.
Comment 22 Ryan O'Hara 2014-01-16 10:08:06 EST
(In reply to Ihar Hrachyshka from comment #17)

> I've successfully set two nodes of qpid accessed by neutron thru haproxy as
> following:
> - created allinone packstack installation on Node1;
> - installed haproxy and keepalived on Node2;
> - installed Qpid on Node3;
> - configured haproxy on Node2 to switch (round-robin) Qpid nodes and serve
> on VIP;
> - configured neutron on Node1 to use VIP;

I'm not sure I understand what you mean by "configured neutron on Node1 to use VIP". Can you explain?
Comment 23 Ihar Hrachyshka 2014-01-16 10:21:45 EST
(In reply to Ryan O'Hara from comment #22)
> I'm not sure I understand what you mean by "configured neutron on Node1 to
> use VIP". Can you explain?

Meaning, I've set qpid_hostname in neutron.conf to use haproxy VIP which hides qpidd instances behind it.
Comment 24 Ryan O'Hara 2014-01-16 10:44:45 EST
(In reply to Ihar Hrachyshka from comment #23)
> (In reply to Ryan O'Hara from comment #22)
> > I'm not sure I understand what you mean by "configured neutron on Node1 to
> > use VIP". Can you explain?
> 
> Meaning, I've set qpid_hostname in neutron.conf to use haproxy VIP which
> hides qpidd instances behind it.

OK. I was concerned that you had neutron bind to the VIP.
Comment 25 Javier Peña 2014-01-17 08:46:51 EST
This might be a long shot, but did you try changing the 'timeout server' and 'timeout client' options for the qpid entry in the HAProxy configuration? 

I saw a similar behaviour in a test setup, and the issue was caused by the fact that HAProxy was closing idle connections after 30 seconds, while the default QPID heartbeat interval in the OpenStack config files was 60 seconds. Then I was seeing frequent disconnections and reconnections, and fixed them by increasing the timeouts.
Comment 26 Fabio Massimo Di Nitto 2014-01-17 08:50:25 EST
The traceback you see is immediate at startup of the service and if the problem is in haproxy, then it doesn´t explain why glance and cinder (that are still not using oslo) do work just fine.

Still worth testing tho.
Comment 27 Javier Peña 2014-01-17 10:25:32 EST
Actually, the last two messages in the traceback show a connection and a disconnection 10 seconds later, which matches the timeout set in the haproxy configuration described at the etherpad. I think it might work.
Comment 28 Ryan O'Hara 2014-01-23 15:57:09 EST
I'd be interested to know if adding 'option tckpa' has any effect. It might be enough to do this only on the server side.
Comment 29 Ihar Hrachyshka 2014-01-27 06:08:58 EST
As it was pointed out above, this is expected to be just a matter of haproxy configuration. QA should try reducing haproxy failover timeouts to avoid the issues mentioned above. Closing the bug.
Comment 30 Fabio Massimo Di Nitto 2014-01-27 06:12:07 EST
please do not close the bug till it´s confirmed that it is actually a configuration problem.
Comment 31 Ryan O'Hara 2014-01-27 10:13:16 EST
(In reply to Ihar Hrachyshka from comment #29)
> As it was pointed out above, this is expected to be just a matter of haproxy
> configuration. QA should try reducing haproxy failover timeouts to avoid the
> issues mentioned above. Closing the bug.

These are connection timeouts, not failover timeouts. If a connection is inactive for a period of time, it will be closed.
Comment 32 Fabio Massimo Di Nitto 2014-01-29 03:09:03 EST
Notes for the doc team start here:

In the process of deploying RHEL OSP on top of RHEL-HA+RHEL-LB we found a series of issues that are timer related.

Similar problems can be seen even without RHEL-HA, but let´s try to keep it simple.

We need a paragraph somewhere in the documentation were we describe the setting of a LoadBalancer placed in between different OSP services/API.

The issue boils down to "make sure that LoadBalancer timeouts are higher than service(s) heartbeat intervals".

Engineering is still working on determine what good timers are to perform fast service recovery operation.

With the default settings, the LoadBalancer needs to express at least a 60 seconds timeout on client and server connections.

haproxy (our specific RHEL-LB) would look like:

global
    daemon
defaults
    mode tcp
    maxconn 10000
    timeout connect 60s
    timeout client 60s
    timeout server 60s

note that we still need to verify if "connect 60s" is necessary or we can use smaller values.

We will keep this bugzilla updated as we complete our investigation.

Thanks
Fabio
Comment 33 Gordon Sim 2014-01-29 04:24:00 EST
(In reply to Fabio Massimo Di Nitto from comment #32)
> The issue boils down to "make sure that LoadBalancer timeouts are higher
> than service(s) heartbeat intervals".

And in the case of qpidd, the LoadBalancer timeout should be greater than *two times* the heartbeat interval set on the qpid client connections (as AMQP defines a timeout to be two missed heartbeats). E.g. if LoadBalancer timeout is 60, the qpid heartbeat should be less than 30 seconds.
Comment 34 Ryan O'Hara 2014-01-29 09:37:55 EST
(In reply to Fabio Massimo Di Nitto from comment #32)
> haproxy (our specific RHEL-LB) would look like:
> 
> global
>     daemon
> defaults
>     mode tcp
>     maxconn 10000
>     timeout connect 60s
>     timeout client 60s
>     timeout server 60s
> 
> note that we still need to verify if "connect 60s" is necessary or we can
> use smaller values.

Use a much smaller value for 'timeout connect'. This should be around 4-5s. It might also be worth mentioning that 'timeout client' and 'timeout server' should be equivalent.
Comment 35 Ryan O'Hara 2014-01-29 11:06:03 EST
Also, it is probably better to leave the default timeouts set to something reasonable for all proxies, then setting 'timeout client' and 'timeout server' sufficiently large enough for qpid in the qpid proxy definition itself. The 'timeout client' goes in the qpid frontend config and 'timeout server' in the qpid backing config.
Comment 36 Martin Lopes 2014-02-04 18:27:21 EST
This bug is being assigned to Summer Long, who is now the designated docs specialist for Compute.
Comment 37 Fabio Massimo Di Nitto 2014-02-04 23:31:06 EST
Summer, I should be able to provide you with more details sometime end of the next week. We are in the process of testing those timers right now and tuning them.
Comment 38 Summer Long 2014-02-17 18:46:48 EST
We're past the freeze for A2, moving over to A3.
Comment 39 Summer Long 2014-03-02 20:11:50 EST
Fabio, you were going to provide more details? We're starting on A3. thanks, Summer
Comment 40 Fabio Massimo Di Nitto 2014-03-02 23:47:50 EST
(In reply to Summer Long from comment #39)
> Fabio, you were going to provide more details? We're starting on A3. thanks,
> Summer

Summer, i would love to, but we are still investigating. We found that many components of openstack still don't manage timers properly (bugs in code). As long as I don't have all the fixes in the various packages, I am unable to do a full test and suggest values.
Comment 41 Summer Long 2014-03-12 21:27:24 EDT
Moving to 5. Too late to get info in for tomorrow's A3 freeze, and we're branching the ICG afterwards to Icehouse.
Comment 42 Summer Long 2014-04-07 22:37:59 EDT
Fabio, how are those tests going?
Comment 45 Summer Long 2014-05-05 03:09:38 EDT
Martin, now that HA/foreman has been ironed out, this might be ready (talk to Scott R.). Assigning to you as the new 'HA' guy.
Comment 55 Martin Lopes 2015-04-28 21:47:30 EDT
Thanks, closing bug
Comment 56 Andrew Dahms 2015-05-12 18:17:02 EDT
Moving to Sprint 5 tracker.

Note You need to log in before you can comment on or make changes to this bug.