Bug 1181005 - receiver.fetch raises KeyError after network glitch
Summary: receiver.fetch raises KeyError after network glitch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: python-qpid
Version: 3.0
Hardware: All
OS: Linux
high
high
Target Milestone: 3.2
: ---
Assignee: Ernie
QA Contact: Zdenek Kraus
URL:
Whiteboard:
Depends On:
Blocks: 1169416
TreeView+ depends on / blocked
 
Reported: 2015-01-12 07:37 UTC by Pavel Moravec
Modified: 2015-10-08 13:10 UTC (History)
6 users (show)

Fixed In Version: python-qpid-0.34-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-08 13:10:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Check to see if channel is associated with a session before using it (1.17 KB, patch)
2015-05-21 19:50 UTC, Ernie
no flags Details | Diff
set connection.reconnect to false on socket errors (2.00 KB, patch)
2015-05-22 16:53 UTC, Ernie
no flags Details | Diff
Enable force flag in SessionManager::attach (2.14 KB, patch)
2015-06-10 17:06 UTC, Ernie
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-6297 0 None None None Never
Red Hat Product Errata RHEA-2015:1879 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.2 Release 2015-10-08 17:07:53 UTC

Description Pavel Moravec 2015-01-12 07:37:16 UTC
Description of problem:
Calling receiver.fetch(timeout=10) in a loop, when network drops packages for a while, this causes uncaught exception KeyError.

High prio/sev due to https://bugzilla.redhat.com/show_bug.cgi?id=1169416#c5.

Upstream qpid fails with infinite loop (see QPID-6297 for details).


Version-Release number of selected component (if applicable):
python-qpid-0.22-19.el6.noarch
python-qpid-0.30-3.el6.noarch


How reproducible:
100%


Steps to Reproduce:
0. decrease TCP keepalives for faster reproducer:
echo 2 > /proc/sys/net/ipv4/tcp_retries2

1. Run below script against qpidd:
#!/usr/bin/env python
from qpid.messaging import *
import datetime

conn = Connection("localhost:5672", reconnect=1)
timeout=10

try:
  conn.open()
  sess = conn.session()

  recv = sess.receiver("testQueue;{create:always}")
  
  while (1):
    print "%s: before fetch, timeout=%s" %(datetime.datetime.now(), timeout)
    msg = Message()
    try:
      msg = recv.fetch(timeout=timeout)
    except ReceiverError, e:
      print e
    print "%s: after fetch, msg=%s" %(datetime.datetime.now(), msg)

  sess.close()

except ReceiverError, e:
  print e
except KeyboardInterrupt:
  pass

conn.close()

2. Simulate network outage:
iptables -A OUTPUT -p tcp --dport 5672 -j REJECT; date

3. Once python script writes "No handlers could be found for logger "qpid.messaging"", flush iptables (iptables -F)

4. Wait 10 seconds


Actual results:
python-qpid-0.22* raises exception:
File "/root/Python_MRG/test_gofer_like.py", line 18, in <module> msg = recv.fetch(timeout=timeout)
File "<string>", line 6, in fetch
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 1041, in fetch self._ecwait(lambda: not self.draining)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait result = self._ewait(lambda: self.closed or predicate(), timeout)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 993, in _ewait result = self.session._ewait(lambda: self.error or predicate(), timeout)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 580, in _ewait result = self.connection._ewait(lambda: self.error or predicate(), timeout)
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 219, in _ewait self.check_error()
File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 212, in check_error raise e InternalError: Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/qpid/messaging/driver.py", line 663, in write op.dispatch(self)
File "/usr/lib/python2.7/site-packages/qpid/ops.py", line 84, in dispatch getattr(target, handler)(self, *args)
File "/usr/lib/python2.7/site-packages/qpid/messaging/driver.py", line 888, in do_session_detached sst = self._sessions.pop(dtc.channel) KeyError: 0

python-qpid-0.30* raises exception:
  File "Python_MRG/test_gofer_like.py", line 18, in <module>
    msg = recv.fetch(timeout=timeout)
  File "<string>", line 6, in fetch
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 1066, in fetch
    self._ecwait(lambda: not self.draining)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
    result = self._ewait(lambda: self.closed or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 1018, in _ewait
    result = self.session._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 595, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
..
RuntimeError: maximum recursion depth exceeded


Expected results:
No exception raised


Additional info:

Comment 2 Ernie 2015-05-20 13:30:59 UTC
The infinite recursion in python-qpid-0.30* is solved by removing the patch for https://issues.apache.org/jira/browse/QPID-5852. That patch is being removed for 3.2.

Still looking into the KeyError exception.

Comment 3 Ernie 2015-05-21 19:50:09 UTC
Created attachment 1028385 [details]
Check to see if channel is associated with a session before using it

https://reviews.apache.org/r/34560/

Comment 4 Ernie 2015-05-22 16:53:44 UTC
Created attachment 1028876 [details]
set connection.reconnect to false on socket errors

From the review: Alan Conway wrote:
>There is something more going on here. If the client is re-connecting then it >should be re-connecting on an entirely new connection so sessions should not be >able to clash, the new connection should have no session on it.
>It sounds to me like somehow the client is re-using the original connection >after you unblock it at the firewall, which is definitely not right - there >could be all kinds of invalid state in that connections sessions. If the client >decides a connection is faulty it should definitively close it and forget it >before re-connecting and re-establishing sessions. You need to track down >how/why it is managing to use the old connection after it has failed.

The engine is reusing the connection because the connection has a reconnect argument. So the new patch will not try and reconnect using the same connection on a socket error. This has the effect of raising the error back to the client.

Comment 5 Ernie 2015-06-10 17:06:01 UTC
Created attachment 1037366 [details]
Enable force flag in SessionManager::attach

Enables the force flag in SessionManager::attach and sends force=True in python engine if the engine is started with a connection that has existing sessions.

Comment 6 Ernie 2015-06-10 17:13:44 UTC
Committed r1684716
http://svn.apache.org/viewvc?view=revision&revision=1684716

Comment 12 Zdenek Kraus 2015-09-30 12:09:40 UTC
This issue was tested on RHEL 6 i386 && x86_64 and RHEL 7 x86_64 with following packages:

python-qpid-0.34-2
python-qpid-proton-0.10-2
python-qpid-proton-doc-0.10-2
python-qpid-qmf-0.34-4
qpid-cpp-client-0.34-5
qpid-cpp-client-devel-0.34-5
qpid-cpp-client-devel-docs-0.34-5
qpid-cpp-client-rdma-0.34-5
qpid-cpp-debuginfo-0.34-5
qpid-cpp-server-0.34-5
qpid-cpp-server-devel-0.34-5
qpid-cpp-server-ha-0.34-5
qpid-cpp-server-linearstore-0.34-5
qpid-cpp-server-rdma-0.34-5
qpid-cpp-server-xml-0.34-5
qpid-java-client-0.30-7
qpid-java-common-0.30-7
qpid-java-example-0.30-7
qpid-jms-client-0.5.0-2
qpid-jms-client-docs-0.5.0-2
qpid-jms-client-examples-0.5.0-2
qpid-jms-client-maven-repo-0.5.0-2
qpid-proton-c-0.10-2
qpid-proton-java-0.10.0-1
qpid-proton-java-maven-repo-0.10.0-1
qpid-qmf-0.34-4
qpid-tools-0.34-1


-> VERIFIED

Comment 14 errata-xmlrpc 2015-10-08 13:10:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-1879.html


Note You need to log in before you can comment on or make changes to this bug.