Bug 1094781
| Summary: | Federated link ends up in Connecting state forever after connecting to shutting down broker | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Pavel Moravec <pmoravec> | |
| Component: | qpid-cpp | Assignee: | Gordon Sim <gsim> | |
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 2.3 | CC: | esammons, freznice, gsim, iboverma, jross, lzhaldyb, mcressma | |
| Target Milestone: | 2.5.1 | |||
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | qpid-cpp-0.18-23 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, a gap in time existed between a socket opening and the mechanism for communicating about the connection starting. If the socket was disconnected within this gap there was no way to signal this event back to the federation code, which initiated the socket open request. This resulted in the socket remaining in the connected state. The connection communication mechanism is now started as soon as the socket connection is requested, which allows the federation code to be aware of any socket state changes.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1097559 (view as bug list) | Environment: | ||
| Last Closed: | 2014-06-30 10:31:05 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1097559 | |||
Appears also in 0.22-36 and in today's upstream. Possible workaround in deleting the link does *not* work - the link remains in the same bad shape even after "link del". Commit 1594277 from [~gsim] in branch 'qpid/trunk' The issue has been fully fixed, tested on RHEL 5.9 i/x, 6.4 i/x, 7.0 x on 2.5.1 candidate packages: python-qpid-0.18-12.el5_10 python-qpid-qmf-0.18-24.el5_10 qpid-cpp-client-0.18-25.el5_10 qpid-cpp-client-devel-0.18-25.el5_10 qpid-cpp-client-devel-docs-0.18-25.el5_10 qpid-cpp-client-rdma-0.18-25.el5_10 qpid-cpp-client-ssl-0.18-25.el5_10 qpid-cpp-mrg-debuginfo-0.18-25.el5_10 qpid-cpp-server-0.18-25.el5_10 qpid-cpp-server-cluster-0.18-25.el5_10 qpid-cpp-server-devel-0.18-25.el5_10 qpid-cpp-server-ha-0.18-25.el5_10 qpid-cpp-server-rdma-0.18-25.el5_10 qpid-cpp-server-ssl-0.18-25.el5_10 qpid-cpp-server-store-0.18-25.el5_10 qpid-cpp-server-xml-0.18-25.el5_10 qpid-java-client-0.18-8.el5_9 qpid-java-common-0.18-8.el5_9 qpid-java-example-0.18-8.el5_9 qpid-jca-0.18-8.el5 qpid-jca-xarecovery-0.18-8.el5 qpid-jca-zip-0.18-8.el5 qpid-qmf-0.18-24.el5_10 qpid-qmf-debuginfo-0.18-24.el5_10 qpid-qmf-devel-0.18-24.el5_10 qpid-tests-0.18-2.el5 qpid-tools-0.18-10.el5_9 rh-qpid-cpp-tests-0.18-25.el5_10 ruby-qpid-qmf-0.18-24.el5_10 python-qpid-0.18-12.el6.noarch python-qpid-qmf-0.18-24.el6.i686 qpid-cpp-client-0.18-25.el6.i686 qpid-cpp-client-devel-0.18-25.el6.i686 qpid-cpp-client-devel-docs-0.18-25.el6.noarch qpid-cpp-client-rdma-0.18-25.el6.i686 qpid-cpp-client-ssl-0.18-25.el6.i686 qpid-cpp-debuginfo-0.18-25.el6.i686 qpid-cpp-server-0.18-25.el6.i686 qpid-cpp-server-cluster-0.18-25.el6.i686 qpid-cpp-server-devel-0.18-25.el6.i686 qpid-cpp-server-ha-0.18-25.el6.i686 qpid-cpp-server-rdma-0.18-25.el6.i686 qpid-cpp-server-ssl-0.18-25.el6.i686 qpid-cpp-server-store-0.18-25.el6.i686 qpid-cpp-server-xml-0.18-25.el6.i686 qpid-java-client-0.18-8.el6_4.noarch qpid-java-common-0.18-8.el6_4.noarch qpid-java-example-0.18-8.el6_4.noarch qpid-jca-0.18-8.el6.noarch qpid-jca-xarecovery-0.18-8.el6.noarch qpid-jca-zip-0.18-8.el6.noarch qpid-qmf-0.18-24.el6.i686 qpid-qmf-debuginfo-0.18-24.el6.i686 qpid-qmf-devel-0.18-24.el6.i686 qpid-tests-0.18-2.el6.noarch qpid-tools-0.18-10.el6_4.noarch rh-qpid-cpp-tests-0.18-25.el6.i686 ruby-qpid-qmf-0.18-24.el6.i686 python-qpid-0.18-12.el7.noarch python-qpid-qmf-0.18-24.el7.x86_64 qpid-cpp-client-0.18-25.el7.x86_64 qpid-cpp-client-devel-0.18-25.el7.x86_64 qpid-cpp-client-devel-docs-0.18-25.el7.noarch qpid-cpp-client-rdma-0.18-25.el7.x86_64 qpid-cpp-client-ssl-0.18-25.el7.x86_64 qpid-cpp-debuginfo-0.18-25.el7.x86_64 qpid-cpp-server-0.18-25.el7.x86_64 qpid-cpp-server-cluster-0.18-25.el7.x86_64 qpid-cpp-server-devel-0.18-25.el7.x86_64 qpid-cpp-server-ha-0.18-25.el7.x86_64 qpid-cpp-server-rdma-0.18-25.el7.x86_64 qpid-cpp-server-ssl-0.18-25.el7.x86_64 qpid-cpp-server-store-0.18-25.el7.x86_64 qpid-qmf-0.18-24.el7.x86_64 qpid-qmf-debuginfo-0.18-24.el7.x86_64 qpid-qmf-devel-0.18-24.el7.x86_64 qpid-tools-0.18-10.el7.noarch rh-qpid-cpp-tests-0.18-25.el7.x86_64 All symptoms are gone: * qpidd federation route not staying in Connecting|Closed by peer state now residing mainly in Waiting|Connection refused or Waiting|Closed by peer state * qpidd starts trying to reconnect forever * when modified 'server' with wait at the end qpid-route finally shows state toggling as expected (Waiting <-> Operational) -> VERIFIED Note: Reproduction was pretty easy on RHEL5, but on RHEL6 the time gap is effectively much shorter, there it is necessary to enable logging to slow qpidd a bit down (log-enable=debug+). All tested on KVM VMs with 2 cores. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0804.html |
Description of problem: Having federation link with source broker S and destination broker D (such that TCP connection is initiated by D and messages flow from S to D), if the link is attempting to reconnect to S while S is just shutting down, there is a probability the link will stay in Connecting state forever. Version-Release number of selected component (if applicable): 0.18-11, 0.18-14, 0.18-20 How reproducible: 100% after some time Steps to Reproduce: 1. Mimic broker S by simple python program: import socket import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Bind the socket to the port server_address = ('localhost', 10000) print >>sys.stderr, 'starting up on %s port %s' % server_address sock.bind(server_address) # Listen for incoming connections sock.listen(1) # Wait for a connection print >>sys.stderr, 'waiting for a connection' connection, client_address = sock.accept() 2. In one terminal, run it in a loop: while true; do python server.py; done 2a. rather for observation: run tcpdump on port 10000 3. In another terminal, create federation link to this "server": qpid-route link add localhost:5672 localhost:10000 4. Wait few seconds and generate whatever traffic to the broker to make it busy, i.e.: qpid-send -a amq.fanout -m 1000000 --content-size=1000 5. Check tcpdump when it stops logging new traffic and execute how many times you wish: qpid-route link list Actual results: Everytime and forever, the link status will be Connecting like: Host Port Transport Durable State Last Error ============================================================================= localhost 10000 tcp N Connecting Closed by peer (expected observation is that python "server" cant bind to port 10000 due to "Address already in use" for some time - that is expected as previous TCP connection is in some FIN_WAIT-like state so far; but even if the "server" can bind to the port after a while, the broker does not attempt to reconnect) Expected results: Link status flapps between Waiting and Connecting, until the server is ready again and the link is Operational (wont happen in this scenario due to the "server.py" implementation) Additional info: The key is, the qpid broker can't send initial "AMQP 0-10" frame to the peer. I.e. the bug appears if and only if: - TCP connection is fully established (3way handshake) such that qpid::broker::connect method returns success - but closed so fast such that Link::established is not invoked / broker does not react on the connection establishment That is why it helps / speedups reproducer to put the broker under load.