Bug 1060711

Summary: neutron qpid reconnection delay must be more accurate
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: openstack-neutronAssignee: Ihar Hrachyshka <ihrachys>
Status: CLOSED ERRATA QA Contact: yfried
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.0CC: apevec, breeler, chrisw, ihrachys, lpeer, majopela, nyechiel, oblaut, yeylon
Target Milestone: z4Keywords: OtherQA, Rebase, ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2013.2.3-3.el6ost Doc Type: Bug Fix
Doc Text:
Cause: The Qpid reconnection delay was fixed to 60 seconds. Consequence: In HA environments, if the Qpid service is not available at connection time, the client will take 60 seconds before trying to reconnect again. Fix: Changed the default reconnection retry time to 5 seconds for 4.0. Result: If Qpid is unavailable at connection time, it will retry in 5 seconds.
Story Points: ---
Clone Of: 1060689 Environment:
Last Closed: 2014-05-29 20:19:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1080561    

Description Fabio Massimo Di Nitto 2014-02-03 12:46:24 UTC
+++ This bug was initially created as a clone of Bug #1060689 +++

The current loop is:

        delay = 1
        while True:
            # Close the session if necessary
            if self.connection.opened():
                try:
                    self.connection.close()
                except qpid_exceptions.ConnectionError:
                    pass

            broker = self.brokers[attempt % len(self.brokers)]
            attempt += 1

            try:
                self.connection_create(broker)
                self.connection.open()
            except qpid_exceptions.ConnectionError, e:
                msg_dict = dict(e=e, delay=delay)
                msg = _("Unable to connect to AMQP server: %(e)s. "
                        "Sleeping %(delay)s seconds") % msg_dict
                LOG.error(msg)
                time.sleep(delay)
                delay = min(2 * delay, 60)

that can lead to over 60 seconds waiting time if the qpid sever is not immediately available at startup.

60 seconds is too long for HA environment where timers need to be very aggressive to reduce downtime to the very minimum.

This is a blocker for HA deployments.

Comment 1 Ihar Hrachyshka 2014-02-03 13:07:31 UTC
Note: this patch, if implemented, won't go to upstream since oslo-rpc that we use is for bug fixing only, and this patch will be considered as too 'featurey'. So we would need to support our own downstream patch for each service that use oslo-rpc if we want to see this in current release.

Comment 4 yfried 2014-04-22 09:12:23 UTC
RHOS 4.0 on RHEL6.5

python-neutron-2013.2.3-4.el6ost.noarch
python-neutronclient-2.3.4-1.el6ost.noarch
openstack-neutron-openvswitch-2013.2.3-4.el6ost.noarch
openstack-neutron-2013.2.3-4.el6ost.noarch

killed qpid service and checked log:
2014-04-22 12:07:21.221 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 1 seconds
2014-04-22 12:07:22.222 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 2 seconds
2014-04-22 12:07:22.438 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 3 seconds
2014-04-22 12:07:22.439 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 3 seconds
2014-04-22 12:07:24.223 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 3 seconds
2014-04-22 12:07:25.439 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 4 seconds
2014-04-22 12:07:25.440 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 4 seconds
2014-04-22 12:07:27.224 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 4 seconds
2014-04-22 12:07:29.440 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 5 seconds
2014-04-22 12:07:29.441 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 5 seconds
2014-04-22 12:07:31.226 20972 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 5 seconds

Comment 6 errata-xmlrpc 2014-05-29 20:19:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0516.html