Bug 526299

Summary: the clustered broker seems to sometimes not send a close-ok before shutting down the socket
Product: Red Hat Enterprise MRG Reporter: Rafael H. Schloming <rafaels>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 1.2CC: aconway, cctrieloff, tross
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the Python client reported a connection aborted error when a socket was closed prior to receiving a 'ConnectionCloseOk' control. This happened intermittently when run against a clustered broker. With this update, the 'ConnectionCloseOk' control is always sent before the socket is closed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:15:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rafael H. Schloming 2009-09-29 17:16:13 UTC
Description of problem:

The python client reports a connection aborted error if the socket is closed prior to receiving a ConnectionCloseOk control. This seems to happen intermittently when run against a clustered broker. Running the client tests in a loop and capturing logging seems to confirm that the broker is sometimes omitting the ConnectionCloseOk control from the normal shutdown dialog.

Version-Release number of selected component (if applicable):

Qpid trunk.

How reproducible:

It's intermittent on any single test run, but running the tests in a loop will reproduce the problem before too long.

Steps to Reproduce:
1. Start a clustered broker with logging enabled.
2. Run the following loop from the python directory:

# run until a failure occurs
while ./qpid-python-test qpid.tests.* -Ddelay=0.1 -v DEBUG -H > /tmp/out.txt; do true; done

3. Examine /tmp/out.txt for the client logs.
4. Examine the broker logs.
  
Actual results:

The end of /tmp/out.txt should look something like this:

  2009-09-29 12:54:06,444 DEBUG SENT: SessionAttach(name='c1165341-a509-48e2-b150-aa8f1fb4e9a9:0')
  2009-09-29 12:54:06,444 DEBUG SENT: SessionCommandPoint(command_id=serial(0), command_offset=0)
  2009-09-29 12:54:06,444 DEBUG SENT: SessionDetach(name='c1165341-a509-48e2-b150-aa8f1fb4e9a9:0')
  2009-09-29 12:54:06,445 DEBUG SENT: ConnectionClose(reply_code=200)
  2009-09-29 12:54:06,449 DEBUG SENT: '\x0f\x00\x008\x00\x00\x00\x00\x00\x00\x00\x00\x02\x01\x01\x00\x00&c1165341-a509-\
48e2-b150-aa8f1fb4e9a9:0\x0f\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x02\x07\x03\x00\x00\x00\x00\x00\x00\x00\x00\x0\
0\x00\x00\x00\x00\x0f\x00\x008\x00\x00\x00\x00\x00\x00\x00\x00\x02\x03\x01\x00\x00&c1165341-a509-48e2-b150-aa8f1fb4e9a9\
:0\x0f\x00\x00\x12\x00\x00\x00\x00\x00\x00\x00\x00\x01\x0b\x01\x00\x00\xc8'
  2009-09-29 12:54:06,451 DEBUG READ: '\x0f\x00\x008\x00\x00\x00\x00\x00\x00\x00\x00\x02\x02\x01\x00\x00&c1165341-a509-\
48e2-b150-aa8f1fb4e9a9:0\x0f\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x02\x07\x03\x00\x00\x00\x00\x00\x00\x00\x00\x0\
0\x00\x00\x00\x00\x0f\x00\x009\x00\x00\x00\x00\x00\x00\x00\x00\x02\x04\x03\x00\x00&c1165341-a509-48e2-b150-aa8f1fb4e9a9\
:0\x00'
  2009-09-29 12:54:06,452 DEBUG RCVD: SessionAttached(name='c1165341-a509-48e2-b150-aa8f1fb4e9a9:0')
  2009-09-29 12:54:06,452 DEBUG RCVD: SessionCommandPoint(command_id=serial(0), command_offset=0)
  2009-09-29 12:54:06,452 DEBUG RCVD: SessionDetached(name='c1165341-a509-48e2-b150-aa8f1fb4e9a9:0', code=0)
  2009-09-29 12:54:06,454 DEBUG ABORTED: <socket._socketobject object at 0x9308fb4>
qpid.tests.messaging.SessionTests.testStop .............................. fail
Error during teardown:
  Traceback (most recent call last):
    File "./qpid-python-test", line 301, in run
      phase()
    File "/home/rhs/qpid/python/qpid/tests/messaging.py", line 57, in teardown
      self.conn.close()
    File "<string>", line 6, in close
    File "/home/rhs/qpid/python/qpid/messaging.py", line 228, in close
      self.disconnect()
    File "<string>", line 6, in disconnect
    File "/home/rhs/qpid/python/qpid/messaging.py", line 194, in disconnect
      self._ewait(lambda: not self._driver._connected)
    File "/home/rhs/qpid/python/qpid/messaging.py", line 144, in _ewait
      self._check_error(exc)
    File "/home/rhs/qpid/python/qpid/messaging.py", line 140, in _check_error
      raise exc(*self.error)
  ConnectionError: connection aborted

The exact test that fails may vary as there is a chance this will occur during any connection close.

The broker logs should look something like this:

2009-09-29 12:54:06 trace SENT [127.0.0.1:43395]: Frame[BEbe; channel=0; {SessionAttachedBody: name=c1165341-a509-48e2-b150-aa8f1fb4e9a9:0; }]
2009-09-29 12:54:06 trace SENT [127.0.0.1:43395]: Frame[BEbe; channel=0; {SessionCommandPointBody: command-id=0; command-offset=0; }]
2009-09-29 12:54:06 trace SENT [127.0.0.1:43395]: Frame[BEbe; channel=0; {SessionDetachedBody: name=c1165341-a509-48e2-b150-aa8f1fb4e9a9:0; code=0; }]
2009-09-29 12:54:06 debug cluster(192.168.1.101:12528 READY) local close of replicated connection 192.168.1.101:12528-90(local)
2009-09-29 12:54:06 trace MCAST 192.168.1.101:12528-90: {ClusterConnectionDeliverCloseBody: }
2009-09-29 12:54:06 trace cluster(192.168.1.101:12528 READY) DLVR: Event[192.168.1.101:12528-90 Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=1; }]]
2009-09-29 12:54:06 trace cluster(192.168.1.101:12528 READY) DLVR 7453:  Frame[BEbe; channel=0; {ClusterConnectionDeliverDoOutputBody: limit=1; }] control 192.168.1.101:12528-90
2009-09-29 12:54:06 trace cluster(192.168.1.101:12528 READY) DLVR: Event[192.168.1.101:12528-90 Frame[BEbe; channel=0; {ClusterConnectionDeliverCloseBody: }]]
2009-09-29 12:54:06 trace cluster(192.168.1.101:12528 READY) DLVR 7454:  Frame[BEbe; channel=0; {ClusterConnectionDeliverCloseBody: }] control 192.168.1.101:12528-90
2009-09-29 12:54:06 info cluster(192.168.1.101:12528 READY) connection closed 192.168.1.101:12528-90
2009-09-29 12:54:06 debug cluster(192.168.1.101:12528 READY) deleted connection: 192.168.1.101:12528-90(local)

Expected results:

The loop should be able to run indefinitely without failure.

Additional info:

On one run I saw the SessionDetached control omitted as well.

Comment 1 Alan Conway 2009-10-01 20:49:48 UTC
Fixed in r820783. 

There was a race where the broker could close the socket before flushing
all the frames in the write queue.

Comment 3 Martin Prpič 2010-10-08 12:13:55 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the python client reported a connection aborted error when a socket was closed prior to receiving a 'ConnectionCloseOk' control. This happened intermittently when run against a clustered broker. With this update, the 'ConnectionCloseOk' control is always send before the shutdown of a socket.

Comment 4 Douglas Silas 2010-10-11 09:49:18 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, the python client reported a connection aborted error when a socket was closed prior to receiving a 'ConnectionCloseOk' control. This happened intermittently when run against a clustered broker. With this update, the 'ConnectionCloseOk' control is always send before the shutdown of a socket.+Previously, the Python client reported a connection aborted error when a socket was closed prior to receiving a 'ConnectionCloseOk' control. This happened intermittently when run against a clustered broker. With this update, the 'ConnectionCloseOk' control is always sent before the socket is closed.

Comment 6 errata-xmlrpc 2010-10-14 16:15:50 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html