Bug 1181005

Summary: receiver.fetch raises KeyError after network glitch
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: python-qpidAssignee: Ernie <eallen>
Status: CLOSED ERRATA QA Contact: Zdenek Kraus <zkraus>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: eallen, iboverma, jross, mcressma, pematous, zkraus
Target Milestone: 3.2Keywords: WorkAround
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: python-qpid-0.34-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-08 13:10:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1169416    
Attachments:
Description Flags
Check to see if channel is associated with a session before using it
none
set connection.reconnect to false on socket errors
none
Enable force flag in SessionManager::attach none

Description Pavel Moravec 2015-01-12 07:37:16 UTC
Description of problem:
Calling receiver.fetch(timeout=10) in a loop, when network drops packages for a while, this causes uncaught exception KeyError.

High prio/sev due to https://bugzilla.redhat.com/show_bug.cgi?id=1169416#c5.

Upstream qpid fails with infinite loop (see QPID-6297 for details).


Version-Release number of selected component (if applicable):
python-qpid-0.22-19.el6.noarch
python-qpid-0.30-3.el6.noarch


How reproducible:
100%


Steps to Reproduce:
0. decrease TCP keepalives for faster reproducer:
echo 2 > /proc/sys/net/ipv4/tcp_retries2

1. Run below script against qpidd:
#!/usr/bin/env python
from qpid.messaging import *
import datetime

conn = Connection("localhost:5672", reconnect=1)
timeout=10

try:
  conn.open()
  sess = conn.session()

  recv = sess.receiver("testQueue;{create:always}")
  
  while (1):
    print "%s: before fetch, timeout=%s" %(datetime.datetime.now(), timeout)
    msg = Message()
    try:
      msg = recv.fetch(timeout=timeout)
    except ReceiverError, e:
      print e
    print "%s: after fetch, msg=%s" %(datetime.datetime.now(), msg)

  sess.close()

except ReceiverError, e:
  print e
except KeyboardInterrupt:
  pass

conn.close()

2. Simulate network outage:
iptables -A OUTPUT -p tcp --dport 5672 -j REJECT; date

3. Once python script writes "No handlers could be found for logger "qpid.messaging"", flush iptables (iptables -F)

4. Wait 10 seconds


Actual results:
python-qpid-0.22* raises exception:
File "/root/Python_MRG/test_gofer_like.py", line 18, in <module> msg = recv.fetch(timeout=timeout)
File "<string>", line 6, in fetch
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 1041, in fetch self._ecwait(lambda: not self.draining)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait result = self._ewait(lambda: self.closed or predicate(), timeout)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 993, in _ewait result = self.session._ewait(lambda: self.error or predicate(), timeout)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 580, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 219, in _ewait self.check_error()
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 212, in check_error raise e InternalError: Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/qpid/messaging/driver.py", line 663, in write op.dispatch(self)
File "/usr/lib/python2.7/site-packages/qpid/ops.py", line 84, in dispatch getattr(target, handler)(self, *args)
File "/usr/lib/python2.7/site-packages/qpid/messaging/driver.py", line 888, in do_session_detached sst = self._sessions.pop(dtc.channel) KeyError: 0

python-qpid-0.30* raises exception:
  File "Python_MRG/test_gofer_like.py", line 18, in <module>
    msg = recv.fetch(timeout=timeout)
  File "<string>", line 6, in fetch
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 1066, in fetch
    self._ecwait(lambda: not self.draining)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
    result = self._ewait(lambda: self.closed or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 1018, in _ewait
    result = self.session._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 595, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
..
RuntimeError: maximum recursion depth exceeded


Expected results:
No exception raised


Additional info:

Comment 2 Ernie 2015-05-20 13:30:59 UTC
The infinite recursion in python-qpid-0.30* is solved by removing the patch for https://issues.apache.org/jira/browse/QPID-5852. That patch is being removed for 3.2.

Still looking into the KeyError exception.

Comment 3 Ernie 2015-05-21 19:50:09 UTC
Created attachment 1028385 [details]
Check to see if channel is associated with a session before using it

https://reviews.apache.org/r/34560/

Comment 4 Ernie 2015-05-22 16:53:44 UTC
Created attachment 1028876 [details]
set connection.reconnect to false on socket errors

From the review: Alan Conway wrote:
>There is something more going on here. If the client is re-connecting then it >should be re-connecting on an entirely new connection so sessions should not be >able to clash, the new connection should have no session on it.
>It sounds to me like somehow the client is re-using the original connection >after you unblock it at the firewall, which is definitely not right - there >could be all kinds of invalid state in that connections sessions. If the client >decides a connection is faulty it should definitively close it and forget it >before re-connecting and re-establishing sessions. You need to track down >how/why it is managing to use the old connection after it has failed.

The engine is reusing the connection because the connection has a reconnect argument. So the new patch will not try and reconnect using the same connection on a socket error. This has the effect of raising the error back to the client.

Comment 5 Ernie 2015-06-10 17:06:01 UTC
Created attachment 1037366 [details]
Enable force flag in SessionManager::attach

Enables the force flag in SessionManager::attach and sends force=True in python engine if the engine is started with a connection that has existing sessions.

Comment 6 Ernie 2015-06-10 17:13:44 UTC
Committed r1684716
http://svn.apache.org/viewvc?view=revision&revision=1684716

Comment 12 Zdenek Kraus 2015-09-30 12:09:40 UTC
This issue was tested on RHEL 6 i386 && x86_64 and RHEL 7 x86_64 with following packages:

python-qpid-0.34-2
python-qpid-proton-0.10-2
python-qpid-proton-doc-0.10-2
python-qpid-qmf-0.34-4
qpid-cpp-client-0.34-5
qpid-cpp-client-devel-0.34-5
qpid-cpp-client-devel-docs-0.34-5
qpid-cpp-client-rdma-0.34-5
qpid-cpp-debuginfo-0.34-5
qpid-cpp-server-0.34-5
qpid-cpp-server-devel-0.34-5
qpid-cpp-server-ha-0.34-5
qpid-cpp-server-linearstore-0.34-5
qpid-cpp-server-rdma-0.34-5
qpid-cpp-server-xml-0.34-5
qpid-java-client-0.30-7
qpid-java-common-0.30-7
qpid-java-example-0.30-7
qpid-jms-client-0.5.0-2
qpid-jms-client-docs-0.5.0-2
qpid-jms-client-examples-0.5.0-2
qpid-jms-client-maven-repo-0.5.0-2
qpid-proton-c-0.10-2
qpid-proton-java-0.10.0-1
qpid-proton-java-maven-repo-0.10.0-1
qpid-qmf-0.34-4
qpid-tools-0.34-1


-> VERIFIED

Comment 14 errata-xmlrpc 2015-10-08 13:10:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-1879.html