Bug 852067

Summary: Qpid python client unable to receive last message from the queue
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: python-qpidAssignee: Pavel Moravec <pmoravec>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: medium    
Version: DevelopmentCC: esammons, gsim, iboverma, jross, pmoravec, tross, trusnak
Target Milestone: 3.1Keywords: Patch, TestCaseProvided
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-qpid-0.30-2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-14 13:46:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch proposal none

Description Frantisek Reznicek 2012-08-27 13:45:32 UTC
Description of problem:

Qpid python client unable to receive last message from the queue.

Testing scenario:
- multi node scenario (machines A,B,C,D)
- broker running on A only
- all nodes running sync test which sends and finally receives messages from queues.

Repeating the test multiple times leads to situation that one of the nodes (let's say C) (rhel6.x) is not able to read last message from 'his' queue - that is seen as test stopping and waiting for sync message.

At this moment qpid-stat -q proves the message is there ready for fetch.
- When A-D sync tests are running noone is able to get that message
- Once A-D sync tests are stopped, then any client can get that message.

This behavior is NOT expected. Reproduced twice on different machines (all virtualized). 
In all cases the the sync test instance which was not able to receive last message was RHEL6.3 i686/x86_64 (A,B,C,D were RHEL5.8 i386,x86_64, RHEL6.3 i686,x86_64).
See below for sync test and transcript.


Version-Release number of selected component (if applicable):
# rpm -qa | grep qpid | sort
python-qpid-0.14-10.el6_3.noarch
python-qpid-qmf-0.14-13.el6_3.i686
qpid-cpp-client-0.14-20.el6_3.i686
qpid-cpp-client-devel-0.14-20.el6_3.i686
qpid-cpp-client-devel-docs-0.14-20.el6_3.noarch
qpid-cpp-client-rdma-0.14-20.el6_3.i686
qpid-cpp-client-ssl-0.14-20.el6_3.i686
qpid-cpp-debuginfo-0.14-20.el6_3.i686
qpid-cpp-server-0.14-20.el6_3.i686
qpid-cpp-server-cluster-0.14-20.el6_3.i686
qpid-cpp-server-devel-0.14-20.el6_3.i686
qpid-cpp-server-rdma-0.14-20.el6_3.i686
qpid-cpp-server-ssl-0.14-20.el6_3.i686
qpid-cpp-server-store-0.14-20.el6_3.i686
qpid-cpp-server-xml-0.14-20.el6_3.i686
qpid-java-client-0.18-1.el6.noarch
qpid-java-common-0.18-1.el6.noarch
qpid-java-example-0.18-1.el6.noarch
qpid-jca-0.18-1.el6.noarch
qpid-jca-xarecovery-0.18-1.el6.noarch
qpid-jca-zip-0.18-1.el6.noarch
qpid-qmf-0.14-13.el6_3.i686
qpid-qmf-debuginfo-0.14-13.el6_3.i686
qpid-qmf-devel-0.14-13.el6_3.i686
qpid-tests-0.14-1.el6_2.noarch
qpid-tools-0.14-5.el6_3.noarch
rh-qpid-cpp-tests-0.14-20.el6_3.i686
ruby-qpid-0.7.946106-2.el6.i686
ruby-qpid-qmf-0.14-13.el6_3.i686


How reproducible:
100%

Steps to Reproduce:
0. extract the package
1. start broker on node A
2. PYTHONPATH=$(pwd)/mrg_grid_test_infrastructure python ./sync2.py <host-A> <host-A> <host-B> <host-C> <host-D> on A,B,C.D
3. wait for hang, perform qpid-stat -q on node A
4. try to get message using any drain (no success)
5. qpid-stat -q on node A shows message still there
6. stop all sync tests launched at step 2
7. try to get message using any drain (success)
8. qpid-stat -q on node A shows all messages received

  
Actual results:
Python qpid client does not see last message and blocks others to get it.

Expected results:
Python qpid client should see all messages and should not block others to get it.

Comment 2 Frantisek Reznicek 2012-08-27 14:00:12 UTC
A
[15:19:07| INFO     ] :: [Multisync]: sync done, round 17
Iteration: 17
[15:19:07| INFO     ] :: [Multisync]: SENT SYNC | bdd2735f9b03299080c98b8db9478c49486764b3
[15:19:07| INFO     ] :: [Multisync]: sync sent fromdhcp-27-79.brq.redhat.com
[15:19:07| INFO     ] :: [Multisync]: Master - dhcp-27-79.brq.redhat.com
[15:19:07| INFO     ] :: [Multisync]: AMQP master-dhcp-27-79.brq.redhat.com
[15:19:07| INFO     ] :: [Multisync]: RECV SYNC | 3da2eb4231ea7b37dbe6f5addffb5ee7e4631377
[15:19:07| INFO     ] :: [Multisync]: AMQP master-dhcp-27-79.brq.redhat.com
[15:19:07| INFO     ] :: [Multisync]: RECV SYNC | bc517aaa1d5bd04323df09069e9580545756cc45
[15:19:07| INFO     ] :: [Multisync]: AMQP master-dhcp-27-79.brq.redhat.com
[15:19:07| INFO     ] :: [Multisync]: RECV SYNC | bdd2735f9b03299080c98b8db9478c49486764b3
[15:19:07| DEBUG    ] :: [Multisync]: Stored SYNC | bdd2735f9b03299080c98b8db9478c49486764b3
[15:19:07| DEBUG    ] :: [Multisync]: Stored SYNC | bc517aaa1d5bd04323df09069e9580545756cc45
[15:19:07| DEBUG    ] :: [Multisync]: Stored SYNC | 3da2eb4231ea7b37dbe6f5addffb5ee7e4631377
[15:20:32| LOG:PASS ] :: testcase 1 - Multisync example

B
[15:19:07| INFO     ] :: [Multisync]: sync done, round 17
Iteration: 17
[15:19:07| INFO     ] :: [Multisync]: SENT SYNC | 3da2eb4231ea7b37dbe6f5addffb5ee7e4631377
[15:19:07| INFO     ] :: [Multisync]: sync sent fromdhcp-27-69.brq.redhat.com
[15:20:31| LOG:PASS ] :: testcase 1 - Multisync example

C
[15:19:08| INFO     ] :: [Multisync]: sync done, round 16
Iteration: 16
[15:19:08| INFO     ] :: [Multisync]: SENT SYNC | 4d5b2ccc4bd18f5936852a842457e774ed703466
[15:19:08| INFO     ] :: [Multisync]: sync sent fromdhcp-27-251.brq.redhat.com

Eventually when pressed ^C sync client is in fetch:

^C[15:20:32| LOG:PASS ] :: testcase 1 - Multisync example
Traceback (most recent call last):
  File "./sync2.py", line 295, in <module>
    main()
  File "./sync2.py", line 283, in main
    sync.block('red',mlist,0)   #
  File "./sync2.py", line 259, in block
    self.node_q.receive({'content':'COMMIT','TYPE':'SYNC','ROUND':self.sync_round}, timeout)
  File "./sync2.py", line 188, in receive
    msgs = self.get_msgs(self.get_recv_q(), timeout)
  File "./sync2.py", line 69, in get_msgs
    r = self.get_session().receiver(amqp_addr)
  File "<string>", line 6, in receiver
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 616, in receiver
    receiver._ewait(lambda: receiver.linked)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 973, in _ewait
    result = self.session._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 208, in _ewait
    result = self._wait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 193, in _wait
    return self._waiter.wait(predicate, timeout=timeout)
  File "/usr/lib/python2.6/site-packages/qpid/concurrency.py", line 57, in wait
    self.condition.wait(3)
  File "/usr/lib/python2.6/site-packages/qpid/concurrency.py", line 96, in wait
    sw.wait(timeout)
  File "/usr/lib/python2.6/site-packages/qpid/compat.py", line 53, in wait
    ready, _, _ = select([self], [], [], timeout)


D
[15:19:07| INFO     ] :: [Multisync]: sync done, round 17
Iteration: 17
[15:19:07| INFO     ] :: [Multisync]: SENT SYNC | bc517aaa1d5bd04323df09069e9580545756cc45
[15:19:07| INFO     ] :: [Multisync]: sync sent fromdhcp-27-147.brq.redhat.com




Ability / non-ability to get message:

# clients running
[root@dhcp-27-251 trusnaktest]# /usr/share/doc/python-qpid-0.14/examples/api/drain -b dhcp-27-79.brq.redhat.com -c 1 'node-dhcp-27-251.brq.redhat.com'
[root@dhcp-27-79 ~]# qpid-stat -q | grep -v 'Y        Y'
Queues
  queue                                          dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ==============================================================================================================================
  node-dhcp-27-79.brq.redhat.com                                        0    17     17       0    102      102         0     2
  master-dhcp-27-79.brq.redhat.com                                      0    71     71       0    284      284         0     2
  node-dhcp-27-69.brq.redhat.com                                        0    17     17       0    102      102         0     2
  node-dhcp-27-251.brq.redhat.com                                       1    17     16       6    102       96         1     2
  node-dhcp-27-147.brq.redhat.com                                       0    17     17       0    102      102         1     2


# clients stopped
[root@dhcp-27-251 trusnaktest]# /usr/share/doc/python-qpid-0.14/examples/api/drain -b dhcp-27-79.brq.redhat.com -c 1 'node-dhcp-27-251.brq.redhat.com'
Message(subject=u'node-route-key', redelivered=True, properties={'x-amqp-0-10.routing-key': u'node-route-key', u'qpid.subject': u'node-route-key', u'TYPE': u'SYNC', u'ROUND': 17, u'id': u'622de5ee6d30ca07f7a6f0f68b5bf75b3ab2992d'}, content='COMMIT')

[root@dhcp-27-79 ~]# qpid-stat -q | grep -v 'Y        Y'
Queues
  queue                                          dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ==============================================================================================================================
  node-dhcp-27-79.brq.redhat.com                                        0    17     17       0    102      102         0     2
  master-dhcp-27-79.brq.redhat.com                                      0    71     71       0    284      284         0     2
  node-dhcp-27-69.brq.redhat.com                                        0    17     17       0    102      102         0     2
  node-dhcp-27-251.brq.redhat.com                                       0    17     17       0    102      102         0     2
  node-dhcp-27-147.brq.redhat.com                                       0    17     17       0    102      102         0     2

Comment 3 Tomas Rusnak 2012-08-27 15:04:25 UTC
Retested with:

RHEL6:
python-qpid-0.14-11.el6_3.noarch
qpid-cpp-server-0.14-21.el6_3

RHEL5:
python-qpid-0.14-7.el5
qpid-cpp-server-0.14-18.el5


Combinations:

    | Broker    | Node     | Result 
------------------------------------
RHEL| RHEL5/64  | RHEL6/64 | Failed
    --------------------------------
    | RHEL5/64  | RHEL6/32 | Failed
    --------------------------------
    | RHEL6/64  | RHEL6/32 | Failed
    --------------------------------
    | RHEL6/32  | RHEL6_64 | Failed
    --------------------------------
    | RHEL6/32  | RHEL5/64 | OK
    --------------------------------
    | RHEL6_64  | RHEL5/64 | OK

It looks like that problem should be somewhere in new python-qpid library.

Comment 4 Frantisek Reznicek 2012-08-27 15:29:46 UTC
I confirm that last released version does not suffer from above described behavior (tested on another set of machines RHEL5.7ix/6.2x):

2.1.2 (0.14)              All OK, no issue encountered

  python-qpid-0.14-6.el5.noarch
  qpid-cpp-*-0.14-14.el5.i386
  qpid-java-*-0.14-3.el5.noarch
  *qpid-qmf-*0.14-9.el5.i386
  qpid-tests-0.14-1.el5.noarch
  qpid-tools-0.14-2.el5.noarch
  
  python-qpid-0.14-6.el5.noarch
  qpid-cpp-*-0.14-14.el5.x86_64
  qpid-java-*-0.14-3.el5.noarch
  *qpid-qmf-*-0.14-9.el5.x86_64
  qpid-tests-0.14-1.el5.noarch
  qpid-tools-0.14-2.el5.noarch
  
  python-qpid-0.14-8.el6.noarch
  qpid-cpp-*-0.14-16.el6.x86_64
  qpid-java-*-0.14-3.el6.noarch
  *qpid-qmf-*0.14-7.el6_2.x86_64
  qpid-tests-0.14-1.el6_2.noarch
  qpid-tools-0.14-2.el6_2.noarch


2.0.3 (0.10)/ qpid-0.12   All OK, no issue encountered

  python-qpid-0.10-1.el5.noarch
  qpid-cpp-*-0.10-8.el5.i386
  qpid-java-*-0.10-9.el5.noarch
  *qpid-qmf*-0.10-10.el5.i386
  qpid-tests-0.10-1.el5.noarch
  qpid-tools-0.10-6.el5.noarch
  
  python-qpid-0.10-1.el5.noarch
  qpid-cpp-*-0.10-8.el5.x86_64
  qpid-java-*-0.10-9.el5.noarch
  *qpid-qmf*-0.10-10.el5.x86_64
  qpid-tests-0.10-1.el5.noarch
  qpid-tools-0.10-6.el5.noarch
  
  python-qpid-0.12-1.el6.noarch
  qpid-cpp-*-0.12-6.el6.x86_64
  qpid-java-*-0.10-11.el6.noarch
  *qpid-qmf*-0.12-6.el6.x86_64
  qpid-tests-0.12-1.el6.noarch
  qpid-tools-0.12-2.el6.noarch


Regression keyword set.

Comment 6 Rafael H. Schloming 2012-08-27 16:19:04 UTC
Is there any particular reason we believe this is a python client issue as opposed to a broker issue? I just reviewed the changes to the python client and there isn't anything that obviously looks like it could cause this sort of thing unless we're getting down to the level of a pre-existing bug being triggered by changes in the timing of unrelated code.

Given that it's intermittent it could also be a broker regression that is only showing up for the python client because it is slow. Would it be possible to test the old client against the new broker and vice versa to try and isolate the fault?

Comment 7 Tomas Rusnak 2012-08-27 16:45:26 UTC
I tried, as it's written in my Comment #3. I had different versions of python client on RHEL5 and RHEL6. If you have old python client over current qpid-cpp-server packages, everything works.

Comment 8 Rafael H. Schloming 2012-08-30 14:31:46 UTC
Oops, didn't notice Comment #3, sorry. I'm still a bit baffled as to what could cause this. There really haven't been any substantive changes to the python client in quite a while. I'll see if I can track down the exact revisions and do a diff.

Comment 13 Rafael H. Schloming 2012-09-04 16:30:38 UTC
got it, thanks

Comment 14 Rafael H. Schloming 2012-09-04 17:52:21 UTC
Ok, I have access to the test infrastructure code now, but I'm still having trouble reproducing. I've started up a broker and I ran two instances of sync2.py with the following arguments:

python ~/Downloads/sync2.py localhost 10.193.23.192

Each one of these sends the following message:

 MessageTransfer(destination='amq.direct', id=serial(11), sync=True, headers=(DeliveryProperties(routing_key='master-route-key'), MessageProperties(application_headers={'qpid.subject': 'master-route-key', 'HOST': 'venture', 'TYPE': 'SYNC', 'ROUND': 1, 'id': 'f279fc2e343e6f7bc8a4c42ddc2c11e3b6fc0cfb'})), payload='SYNC')

After sending the above message they both then proceed to busy loop subscribing to the 'node-venture' queue:

2012-09-04 13:22:35,134 DEBUG SENT[222c998]: MessageSubscribe(queue='node-venture', destination='21', accept_mode=0, acquire_mode=0, id=serial(194))
2012-09-04 13:22:35,134 DEBUG SENT[222c998]: MessageSetFlowMode(destination='21', flow_mode=0, id=serial(195), sync=True)
2012-09-04 13:22:35,137 DEBUG RCVD[222c998]: SessionCompleted(commands=[0-195])
2012-09-04 13:22:35,138 DEBUG SENT[222c998]: MessageFlow(destination='21', unit=1, value=4294967295L, id=serial(196))
2012-09-04 13:22:35,139 DEBUG SENT[222c998]: MessageFlow(destination='21', unit=0, value=1L, id=serial(197))
2012-09-04 13:22:35,139 DEBUG SENT[222c998]: ExecutionSync(id=serial(198), sync=True)
2012-09-04 13:22:35,140 DEBUG SENT[222c998]: MessageFlush(destination='21', id=serial(199), sync=True)
2012-09-04 13:22:35,142 DEBUG RCVD[222c998]: SessionCompleted(commands=[0-198])
2012-09-04 13:22:35,143 DEBUG RCVD[222c998]: SessionCompleted(commands=[0-199])
2012-09-04 13:22:35,144 DEBUG SENT[222c998]: MessageFlow(destination='21', unit=1, value=4294967295L, id=serial(200))
2012-09-04 13:22:35,144 DEBUG SENT[222c998]: MessageFlow(destination='21', unit=0, value=1L, id=serial(201))
2012-09-04 13:22:35,145 DEBUG SENT[222c998]: MessageCancel(destination='21', id=serial(202), sync=True)
2012-09-04 13:22:35,147 DEBUG RCVD[222c998]: SessionCompleted(commands=[0-202])

Is this what is supposed to be happening? I assume from the comments that something should be sending more messages than this, but I can't seem to find the code that would do that.

Would it be possible for you to put the following in the python script that hangs and send me the output?

from qpid.log import *
enable("qpid.messaging.io.ops", DEBUG)

Comment 15 Rafael H. Schloming 2012-09-04 19:49:30 UTC
Ok, I believe I managed to track down the issue and I think it's actually a problem in the test code. First off the reason for the odd behavior I mentioned above was that both nodes thought they were the master, so I modified this line of the script so I could explicitly control the 'node' variable via the environment:

    self.node = os.environ.get("NODE", platform.node())
    print "NODE=%s" % self.node

That done, I was able to reproduce the problem as described above, and do so with protocol logging enabled. That resulted in the following trace:

2012-09-04 14:29:18,140 DEBUG SENT[262ccb0]: MessageTransfer(destination='amq.direct', id=serial(1630), sync=True, headers=(DeliveryProperties(routing_key='master-route-key'), MessageProperties(application_headers={'qpid.subject': 'master-route-key', 'HOST': 'localhost', 'TYPE': 'SYNC', 'ROUND': 38, 'id': '536f9d816442c3f183a6fec709bae114c478c55c'})), payload='SYNC')
2012-09-04 14:29:18,144 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1630])
2012-09-04 14:29:18,146 DEBUG SENT[262ccb0]: MessageSubscribe(queue='node-localhost', destination='156', accept_mode=0, acquire_mode=0, id=serial(1631))
2012-09-04 14:29:18,147 DEBUG SENT[262ccb0]: MessageSetFlowMode(destination='156', flow_mode=0, id=serial(1632), sync=True)
2012-09-04 14:29:18,150 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1632])
2012-09-04 14:29:18,152 DEBUG SENT[262ccb0]: MessageFlow(destination='156', unit=1, value=4294967295L, id=serial(1633))
2012-09-04 14:29:18,153 DEBUG SENT[262ccb0]: MessageFlow(destination='156', unit=0, value=1L, id=serial(1634))
2012-09-04 14:29:18,157 DEBUG SENT[262ccb0]: ExecutionSync(id=serial(1635), sync=True)
2012-09-04 14:29:18,158 DEBUG SENT[262ccb0]: MessageFlush(destination='156', id=serial(1636), sync=True)
2012-09-04 14:29:18,194 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1635])
2012-09-04 14:29:18,195 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1636])
2012-09-04 14:29:18,197 DEBUG SENT[262ccb0]: MessageFlow(destination='156', unit=1, value=4294967295L, id=serial(1637))
2012-09-04 14:29:18,198 DEBUG SENT[262ccb0]: MessageFlow(destination='156', unit=0, value=1L, id=serial(1638))
2012-09-04 14:29:18,199 DEBUG SENT[262ccb0]: MessageCancel(destination='156', id=serial(1639), sync=True)
2012-09-04 14:29:18,202 DEBUG RCVD[262ccb0]: MessageTransfer(destination=u'156', accept_mode=0, acquire_mode=0, id=serial(41), headers=[MessageProperties(application_headers={u'qpid.subject': u'node-route-key', u'TYPE': u'SYNC', u'ROUND': 38, u'id': u'628b85ead31b18e0b0bb7a84177ff79f159831ef'}), DeliveryProperties(exchange=u'amq.direct', routing_key=u'node-route-key')], payload='COMMIT')
2012-09-04 14:29:18,204 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1639])
2012-09-04 14:29:18,206 DEBUG SENT[262ccb0]: MessageSubscribe(queue='node-localhost', destination='157', accept_mode=0, acquire_mode=0, id=serial(1640))
2012-09-04 14:29:18,207 DEBUG SENT[262ccb0]: MessageSetFlowMode(destination='157', flow_mode=0, id=serial(1641), sync=True)
2012-09-04 14:29:18,210 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1641])
2012-09-04 14:29:18,213 DEBUG SENT[262ccb0]: MessageFlow(destination='157', unit=1, value=4294967295L, id=serial(1642))
2012-09-04 14:29:18,213 DEBUG SENT[262ccb0]: MessageFlow(destination='157', unit=0, value=1L, id=serial(1643))
2012-09-04 14:29:18,214 DEBUG SENT[262ccb0]: ExecutionSync(id=serial(1644), sync=True)
2012-09-04 14:29:18,215 DEBUG SENT[262ccb0]: MessageFlush(destination='157', id=serial(1645), sync=True)
2012-09-04 14:29:18,234 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1644])
2012-09-04 14:29:18,235 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1645])
2012-09-04 14:29:18,237 DEBUG SENT[262ccb0]: MessageFlow(destination='157', unit=1, value=4294967295L, id=serial(1646))
2012-09-04 14:29:18,237 DEBUG SENT[262ccb0]: MessageFlow(destination='157', unit=0, value=1L, id=serial(1647))
2012-09-04 14:29:18,239 DEBUG SENT[262ccb0]: MessageCancel(destination='157', id=serial(1648), sync=True)
2012-09-04 14:29:18,242 DEBUG RCVD[262ccb0]: SessionCompleted(commands=[0-1648])

Careful examination shows that because the receiver being used to poll the queue is always assigned a capacity of 1, there is actually a window where the message transfer can arrive just prior to canceling the subscription. If this happens the message is counted as aquired by the broker, however it is never seen by the test application since the test application has thrown away that receiver already.

Modifying the test to remove the setting of receiver capacity to 1 (commented out below) should fix the problem:

  def get_msgs(self, amqp_addr, timeout=1):
    ''' receive messages from queue and store into self.msgs '''
    msgs = {}
    r = self.get_session().receiver(amqp_addr)
#    r.capacity = 1
    #ts1 = time.time();# !!!
    #self.logger.rlLogLevel("[debug]: %s" % (ts1),1); #!!!
    try:
      while True:
        msg = r.fetch(timeout)
        msgs[msg.properties['id']] = msg

        self.logger.rlLogLevel("[Multisync]: AMQP %s" % (amqp_addr),1)
        self.logger.rlLogLevel("[Multisync]: RECV %s | %s" % (msg.content,msg.properties['id']),1)
        r.session.acknowledge(msg)
    except Empty:
      #self.logger.rlLogLevel("[debug]: Empty",1); #!!!
      r.close()
    #ts2 = time.time();# !!!
    #self.logger.rlLogLevel("[debug]: %s dur:(%d)" % (ts2, ts2-ts1),1); #!!!

    return msgs

I'm not sure why this issue wasn't encountered before, the semantics have remain unchanged between the two versions. I can only speculate that some unrelated code changes may have tweaked the timing enough to cause it to show up more frequently.

Can you verify that the fix works in your test environment? If so I think we can close this as NOTABUG, although we could probably add a feature request for a log warning in this circumstance to make it more clear what is going on.

Comment 16 Gordon Sim 2012-09-04 20:36:09 UTC
Shouldn't closing the receiver cause all prefetched messages that have not been delivered to the application to be released?

Comment 17 Rafael H. Schloming 2012-09-04 20:50:44 UTC
It certainly could do that too. I'd still be tempted to have a log message though (perhaps info or something) since in cases like this it is indicative of something weird going on.

Comment 18 Frantisek Reznicek 2012-09-05 07:11:49 UTC
I confirm that modification of capacity to default fixes bad behavior. Retested in long stress in 4 node sync RHEL5.8 i/x 6.3 i/x.

Comment 28 Pavel Moravec 2013-08-27 11:17:59 UTC
Created attachment 790905 [details]
Patch proposal

The client really does not follow https://bugzilla.redhat.com/show_bug.cgi?id=852067#c16.

Two code changes were required:

1) When receiving a message for a receiver that is closing or closed, release them back. That is the change in do_message_transfer method.

2) The key one: when closing receiver (and sending message.cancel), check for received messages for this receiver not fetched by the application yet - and release them back as well. That is the change in do_unlink.

There are two gotchas, however:
a) code in do_unlink is quite inefficient (esp. repetitive calling "rcv.session._pop(rcv)" that is inefficient for bigger session.incoming list where first items of the list are checked every time; there should be a method pop_messages_for_receiver in Session class that traverses "incoming" list just once.
b) I still *think* the below interleaving of code execution could lead to the same issue:
  - thread1 starts do_message_transfer and passes "if rcv.closing or rcv.closed:" test
  - thread2 executes whole receiver closing, for that there is still no message in session.incoming list
  - thread1 continues in its method do_message_transfer and adds the message to session.incoming

Now, the message sits in session.incoming forever and again is never released.

Anyway, I run - several times - 1000 iterations of the reproducer(*) without any issue.

(*) by changing "for i in range(1000):" in main of sync2.py

Comment 29 Pavel Moravec 2013-08-27 11:19:55 UTC
Rafael, could you please review the patch and gotchas in https://bugzilla.redhat.com/show_bug.cgi?id=852067#c28 ?

Thanks.

Comment 30 Rafael H. Schloming 2013-08-28 12:19:01 UTC
I checked out the patch and it looks good to me.

Comment 31 Pavel Moravec 2013-09-27 13:56:02 UTC
Committed upstream in r1526901.

Comment 33 Pavel Moravec 2013-10-04 09:12:53 UTC
(In reply to Pavel Moravec from comment #31)
> Committed upstream in r1526901.

Wrongly set MODIFIED state, as the fix is committed to upstream only.

Comment 35 Frantisek Reznicek 2015-03-18 09:15:09 UTC
The issue has been finally fixed, proved by extensive test execution on 4 nodes (RHEL 6.6i/x 7.1x) on packages:

  # rpm -qa | grep qpid | sort | grep -v debuginfo
  python-qpid-0.30-5.el6.noarch
  python-qpid-qmf-0.30-5.el6.i686
  qpid-cpp-client-0.30-7.el6.i686
  qpid-cpp-client-devel-0.30-7.el6.i686
  qpid-cpp-client-devel-docs-0.30-7.el6.noarch
  qpid-cpp-client-rdma-0.30-7.el6.i686
  qpid-cpp-server-0.30-7.el6.i686
  qpid-cpp-server-devel-0.30-7.el6.i686
  qpid-cpp-server-ha-0.30-7.el6.i686
  qpid-cpp-server-linearstore-0.30-7.el6.i686
  qpid-cpp-server-rdma-0.30-7.el6.i686
  qpid-cpp-server-xml-0.30-7.el6.i686
  qpid-java-client-0.30-4.el6.noarch
  qpid-java-common-0.30-4.el6.noarch
  qpid-java-example-0.30-4.el6.noarch
  qpid-jca-0.22-2.el6.noarch
  qpid-jca-xarecovery-0.22-2.el6.noarch
  qpid-proton-c-0.7-4.el6.i686
  qpid-proton-c-devel-0.7-4.el6.i686
  qpid-qmf-0.30-5.el6.i686
  qpid-qmf-devel-0.30-5.el6.i686
  qpid-tools-0.30-4.el6.noarch
  ruby-qpid-0.7.946106-2.el6.i686
  
  # rpm -qa | grep qpid | sort | grep -v debuginfo
  python-qpid-0.30-5.el7.noarch
  python-qpid-qmf-0.30-5.el7.x86_64
  qpid-cpp-client-0.30-7.el7.x86_64
  qpid-cpp-client-devel-0.30-7.el7.x86_64
  qpid-cpp-client-devel-docs-0.30-7.el7.noarch
  qpid-cpp-client-rdma-0.30-7.el7.x86_64
  qpid-cpp-server-0.30-7.el7.x86_64
  qpid-cpp-server-devel-0.30-7.el7.x86_64
  qpid-cpp-server-ha-0.30-7.el7.x86_64
  qpid-cpp-server-linearstore-0.30-7.el7.x86_64
  qpid-cpp-server-rdma-0.30-7.el7.x86_64
  qpid-java-client-0.30-5.el7.noarch
  qpid-java-common-0.30-5.el7.noarch
  qpid-java-example-0.30-5.el7.noarch
  qpid-proton-c-0.7-5.el7.x86_64
  qpid-proton-c-devel-0.7-5.el7.x86_64
  qpid-qmf-0.30-5.el7.x86_64
  qpid-qmf-devel-0.30-5.el7.x86_64
  qpid-tools-0.30-4.el7.noarch

-> VERIFIED

Comment 37 errata-xmlrpc 2015-04-14 13:46:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0805.html