Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1094781

Summary:	Federated link ends up in Connecting state forever after connecting to shutting down broker
Product:	Red Hat Enterprise MRG	Reporter:	Pavel Moravec <pmoravec>
Component:	qpid-cpp	Assignee:	Gordon Sim <gsim>
Status:	CLOSED ERRATA	QA Contact:	Frantisek Reznicek <freznice>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.3	CC:	esammons, freznice, gsim, iboverma, jross, lzhaldyb, mcressma
Target Milestone:	2.5.1
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	qpid-cpp-0.18-23	Doc Type:	Bug Fix
Doc Text:	Previously, a gap in time existed between a socket opening and the mechanism for communicating about the connection starting. If the socket was disconnected within this gap there was no way to signal this event back to the federation code, which initiated the socket open request. This resulted in the socket remaining in the connected state. The connection communication mechanism is now started as soon as the socket connection is requested, which allows the federation code to be aware of any socket state changes.	Story Points:	---
Clone Of:
Clones:	1097559 (view as bug list)		Environment:
Last Closed:	2014-06-30 10:31:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1097559

Description Pavel Moravec 2014-05-06 13:30:16 UTC

Description of problem:
Having federation link with source broker S and destination broker D (such that TCP connection is initiated by D and messages flow from S to D), if the link is attempting to reconnect to S while S is just shutting down, there is a probability the link will stay in Connecting state forever.


Version-Release number of selected component (if applicable):
0.18-11, 0.18-14, 0.18-20


How reproducible:
100% after some time


Steps to Reproduce:
1. Mimic broker S by simple python program:

import socket
import sys

# Create a TCP/IP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Bind the socket to the port
server_address = ('localhost', 10000)
print >>sys.stderr, 'starting up on %s port %s' % server_address
sock.bind(server_address)
# Listen for incoming connections
sock.listen(1)

# Wait for a connection
print >>sys.stderr, 'waiting for a connection'
connection, client_address = sock.accept()

2. In one terminal, run it in a loop:
while true; do python server.py; done

2a. rather for observation: run tcpdump on port 10000

3. In another terminal, create federation link to this "server":
qpid-route link add localhost:5672 localhost:10000

4. Wait few seconds and generate whatever traffic to the broker to make it busy, i.e.:
qpid-send -a amq.fanout -m 1000000 --content-size=1000

5. Check tcpdump when it stops logging new traffic and execute how many times you wish:
qpid-route link list

Actual results:
Everytime and forever, the link status will be Connecting like:

Host            Port    Transport Durable  State             Last Error
=============================================================================
localhost       10000   tcp          N     Connecting        Closed by peer

(expected observation is that python "server" cant bind to port 10000 due to "Address already in use" for some time - that is expected as previous TCP connection is in some FIN_WAIT-like state so far; but even if the "server" can bind to the port after a while, the broker does not attempt to reconnect)


Expected results:
Link status flapps between Waiting and Connecting, until the server is ready again and the link is Operational (wont happen in this scenario due to the "server.py" implementation)


Additional info:
The key is, the qpid broker can't send initial "AMQP 0-10" frame to the peer. I.e. the bug appears if and only if:
- TCP connection is fully established (3way handshake) such that qpid::broker::connect method returns success
- but closed so fast such that Link::established is not invoked / broker does not react on the connection establishment

That is why it helps / speedups reproducer to put the broker under load.

Comment 1 Pavel Moravec 2014-05-06 13:44:56 UTC

Appears also in 0.22-36 and in today's upstream.

Possible workaround in deleting the link does *not* work - the link remains in the same bad shape even after "link del".

Comment 3 Pavel Moravec 2014-05-14 05:48:48 UTC

Commit 1594277 from [~gsim] in branch 'qpid/trunk'

Comment 8 Frantisek Reznicek 2014-06-26 12:15:35 UTC

The issue has been fully fixed, tested on RHEL 5.9 i/x, 6.4 i/x, 7.0 x on 2.5.1 candidate packages:

python-qpid-0.18-12.el5_10
python-qpid-qmf-0.18-24.el5_10
qpid-cpp-client-0.18-25.el5_10
qpid-cpp-client-devel-0.18-25.el5_10
qpid-cpp-client-devel-docs-0.18-25.el5_10
qpid-cpp-client-rdma-0.18-25.el5_10
qpid-cpp-client-ssl-0.18-25.el5_10
qpid-cpp-mrg-debuginfo-0.18-25.el5_10
qpid-cpp-server-0.18-25.el5_10
qpid-cpp-server-cluster-0.18-25.el5_10
qpid-cpp-server-devel-0.18-25.el5_10
qpid-cpp-server-ha-0.18-25.el5_10
qpid-cpp-server-rdma-0.18-25.el5_10
qpid-cpp-server-ssl-0.18-25.el5_10
qpid-cpp-server-store-0.18-25.el5_10
qpid-cpp-server-xml-0.18-25.el5_10
qpid-java-client-0.18-8.el5_9
qpid-java-common-0.18-8.el5_9
qpid-java-example-0.18-8.el5_9
qpid-jca-0.18-8.el5
qpid-jca-xarecovery-0.18-8.el5
qpid-jca-zip-0.18-8.el5
qpid-qmf-0.18-24.el5_10
qpid-qmf-debuginfo-0.18-24.el5_10
qpid-qmf-devel-0.18-24.el5_10
qpid-tests-0.18-2.el5
qpid-tools-0.18-10.el5_9
rh-qpid-cpp-tests-0.18-25.el5_10
ruby-qpid-qmf-0.18-24.el5_10

python-qpid-0.18-12.el6.noarch
python-qpid-qmf-0.18-24.el6.i686
qpid-cpp-client-0.18-25.el6.i686
qpid-cpp-client-devel-0.18-25.el6.i686
qpid-cpp-client-devel-docs-0.18-25.el6.noarch
qpid-cpp-client-rdma-0.18-25.el6.i686
qpid-cpp-client-ssl-0.18-25.el6.i686
qpid-cpp-debuginfo-0.18-25.el6.i686
qpid-cpp-server-0.18-25.el6.i686
qpid-cpp-server-cluster-0.18-25.el6.i686
qpid-cpp-server-devel-0.18-25.el6.i686
qpid-cpp-server-ha-0.18-25.el6.i686
qpid-cpp-server-rdma-0.18-25.el6.i686
qpid-cpp-server-ssl-0.18-25.el6.i686
qpid-cpp-server-store-0.18-25.el6.i686
qpid-cpp-server-xml-0.18-25.el6.i686
qpid-java-client-0.18-8.el6_4.noarch
qpid-java-common-0.18-8.el6_4.noarch
qpid-java-example-0.18-8.el6_4.noarch
qpid-jca-0.18-8.el6.noarch
qpid-jca-xarecovery-0.18-8.el6.noarch
qpid-jca-zip-0.18-8.el6.noarch
qpid-qmf-0.18-24.el6.i686
qpid-qmf-debuginfo-0.18-24.el6.i686
qpid-qmf-devel-0.18-24.el6.i686
qpid-tests-0.18-2.el6.noarch
qpid-tools-0.18-10.el6_4.noarch
rh-qpid-cpp-tests-0.18-25.el6.i686
ruby-qpid-qmf-0.18-24.el6.i686


python-qpid-0.18-12.el7.noarch
python-qpid-qmf-0.18-24.el7.x86_64
qpid-cpp-client-0.18-25.el7.x86_64
qpid-cpp-client-devel-0.18-25.el7.x86_64
qpid-cpp-client-devel-docs-0.18-25.el7.noarch
qpid-cpp-client-rdma-0.18-25.el7.x86_64
qpid-cpp-client-ssl-0.18-25.el7.x86_64
qpid-cpp-debuginfo-0.18-25.el7.x86_64
qpid-cpp-server-0.18-25.el7.x86_64
qpid-cpp-server-cluster-0.18-25.el7.x86_64
qpid-cpp-server-devel-0.18-25.el7.x86_64
qpid-cpp-server-ha-0.18-25.el7.x86_64
qpid-cpp-server-rdma-0.18-25.el7.x86_64
qpid-cpp-server-ssl-0.18-25.el7.x86_64
qpid-cpp-server-store-0.18-25.el7.x86_64
qpid-qmf-0.18-24.el7.x86_64
qpid-qmf-debuginfo-0.18-24.el7.x86_64
qpid-qmf-devel-0.18-24.el7.x86_64
qpid-tools-0.18-10.el7.noarch
rh-qpid-cpp-tests-0.18-25.el7.x86_64


All symptoms are gone:
 * qpidd federation route not staying in Connecting|Closed by peer state
   now residing mainly in Waiting|Connection refused or Waiting|Closed by peer state
 * qpidd starts trying to reconnect forever
 * when modified 'server' with wait at the end qpid-route finally shows state toggling as expected (Waiting <-> Operational)
 

-> VERIFIED

Note: Reproduction was pretty easy on RHEL5, but on RHEL6 the time gap is effectively much shorter, there it is necessary to enable logging to slow qpidd a bit down (log-enable=debug+). All tested on KVM VMs with 2 cores.

Comment 10 errata-xmlrpc 2014-06-30 10:31:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0804.html