Bug 1444206

Summary: python-qpid crashes with "assert rcv.received < rcv.impending"
Product: Red Hat Enterprise MRG Reporter: Derek Whatley <dwhatley>
Component: python-qpidAssignee: Kim van der Riet <kim.vdriet>
Status: ASSIGNED --- QA Contact: Messaging QE <messaging-qe-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: DevelopmentCC: jross, kim.vdriet
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed fix none

Description Derek Whatley 2017-04-20 20:44:25 UTC
======================
Description of problem:
======================

Within several of the RHUI instances I am operating, the RHUA (qpid broker) will lose connection to CDSs (clients) ~1-2 days after the broker has been restarted.

The problem appears to be related to an assertion being made by the qpid driver.
https://github.com/apache/qpid-python/blob/master/qpid/messaging/driver.py#L1403

/var/lib/pulp/pulp.log is spammed with error messages similar to the one included at the bottom of this BZ report.


============================================================
Version-Release number of selected component (if applicable):
============================================================
python-qpid-0.14-11.el6_3.noarch
gofer-0.65.1-1.el6.noarch
pulp-0.0.263-45.el6.noarch

================
How reproducible:
================
The same instances of RHUI are reliably breaking down and needing a broker restart several times a week each.


==================
Steps to Reproduce:
==================
1. Set up RHUI instance
2. Let CDS sync run as scheduled
3. CDS connection will be lost every X days due to:

assert rcv.received < rcv.impending, "%s, %s" % (rcv.received, rcv.impending)
AssertionError: 8710, 8710 (visible in /var/log/pulp/pulp.log)

===============
Additional info:
===============
(Excerpt from /var/log/pulp/pulp.log)

2017-04-20 14:07:36,289 12640:139879593780992: pulp.server.api.synchronizers:INFO: synchronizers:825 preserve metadata flag is set; skipping metadata update
2017-04-20 14:07:36,306 12640:139879593780992: pulp.server.api.synchronizers:INFO: synchronizers:346 skipping distribution imports from sync process
2017-04-20 14:07:36,407 12640:139879593780992: pulp.server.api.synchronizers:INFO: synchronizers:420 No metadata for 'File Sync' present; no files to import to repo..
2017-04-20 14:07:36,441 12640:139879593780992: pulp.server.api.synchronizers:INFO: synchronizers:383 Loading comps group info from: /var/lib/pulp//repos/content/dist/rhel/rhui/server/6/6Server/x86_64/jbeap/7.0/source/SRPMS/repodata/cb329cadcd7701d5b4f752e6ef8fbe7fe2bbc3a74b971d47c3ab3682c5e67121-comps.xml
2017-04-20 14:07:36,446 12640:139879593780992: pulp.server.api.synchronizers:INFO: synchronizers:405 updateinfo is found in repomd.xml, it's path is /var/lib/pulp//repos/content/dist/rhel/rhui/server/6/6Server/x86_64/jbeap/7.0/source/SRPMS/repodata/b2a7d81a6dffba53e2981370ab45694b184ba7115b467d7ffbde15b8d14bce5a-updateinfo.xml.gz
2017-04-20 14:07:38,088 12640:139880373942016: gofer.messaging.consumer:ERROR: consumer:72 failed:
None
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/gofer/messaging/consumer.py", line 64, in run
    msg = self.__fetch(receiver)
  File "/usr/lib/python2.6/site-packages/gofer/messaging/consumer.py", line 100, in __fetch
    return receiver.fetch(timeout=self.WAIT)
  File "<string>", line 6, in fetch
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 1010, in fetch
    self._ecwait(lambda: self.linked)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
    result = self._ewait(lambda: self.closed or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 973, in _ewait
    result = self.session._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
    self.check_error()
  File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 202, in check_error
    raise self.error
InternalError: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 655, in write
    op.dispatch(self)
  File "/usr/lib/python2.6/site-packages/qpid/ops.py", line 84, in dispatch
    getattr(target, handler)(self, *args)
  File "/usr/lib/python2.6/site-packages/qpid/messaging/driver.py", line 1293, in do_message_transfer
    assert rcv.received < rcv.impending, "%s, %s" % (rcv.received, rcv.impending)
AssertionError: 8710, 8710

Comment 1 Derek Whatley 2017-04-25 14:35:22 UTC
After speaking with Jeff Ortel, I believe the core issue is connection failure between the RHUA's qpid broker and the RHUA's qpid client.

Connections from localhost (qpidd broker :5674) to localhost (pulp qpid client) are entering an "error" state as indicated by /var/log/messages on the RHUA:

sudo grep -i 'qpid' /var/log/messages* | less

# Error messages start piling up in /var/log/messages
 
qpidd[9301]: 2017-04-05 22:54:05 error Connection 127.0.0.1:5674-127.0.0.1:46880 timed out: closing
qpidd[9301]: 2017-04-06 11:03:55 error Connection 127.0.0.1:5674-127.0.0.1:54796 timed out: closing
qpidd[9301]: 2017-04-06 23:36:19 error Connection 127.0.0.1:5674-127.0.0.1:36286 timed out: closing
qpidd[9301]: 2017-04-07 12:14:26 error Connection 127.0.0.1:5674-127.0.0.1:55290 timed out: closing

# RHUA is unable to connect to CDSs and is now restarted.

qpidd[9301]: 2017-04-07 17:08:03 notice Shut down
qpidd[29184]: 2017-04-07 17:08:03 notice SASL disabled: No Authentication Performed
qpidd[29184]: 2017-04-07 17:08:03 notice Listening on TCP/TCP6 port 5672
qpidd[29184]: 2017-04-07 17:08:03 notice Listening for SSL connections on TCP port 5674
qpidd[29184]: 2017-04-07 17:08:03 notice Broker running

# In this example, the problem returns after ~5 hours.

qpidd[29184]: 2017-04-07 22:00:55 error Connection 127.0.0.1:5674-127.0.0.1:60316 timed out: closing
qpidd[29184]: 2017-04-08 04:10:22 error Connection 127.0.0.1:5674-127.0.0.1:51204 timed out: closing
qpidd[29184]: 2017-04-08 16:07:38 error Connection 127.0.0.1:5674-127.0.0.1:41936 timed out: closing
qpidd[29184]: 2017-04-08 22:02:55 error Connection 127.0.0.1:5674-127.0.0.1:44304 timed out: closing
qpidd[29184]: 2017-04-09 10:03:27 error Connection 127.0.0.1:5674-127.0.0.1:56836 timed out: closing

Comment 2 Derek Whatley 2017-04-25 14:46:20 UTC
> # RHUA is unable to connect to CDSs and is now restarted.

To clarify: at this point the pulp-server (qpidd, httpd, mongod) service running on the RHUA is restarted. The entire RHUA machine is _not_ restarted.

Comment 3 Kim van der Riet 2017-06-02 16:45:08 UTC
Created attachment 1284486 [details]
Proposed fix

Investigation of this issue revealed 3 issues:

1. The heartbeat timeout value used by gofer (10sec) is rather low, and the timeout error occurs when a processing delay on any node of > 20 secs (2 * timeout) occurs. For these instances, that seems to occur every couple of hours. The heartbeat value should be set to a higher value (suggest 60 sec).

2. Once a timeout occurs, the driver Engine object is closed, but no check is made elsewhere in the driver for this state, and as the delay in point 1. above is temporary, once normal processing continues, message transfers are being attempted on the closed Engine object, resulting in the Python error "assert rcv.received < rcv.impending".

3. In addition to this, a bug in the driver logic prevents the connection from being reestablished.

This patch addresses issues 2. and 3. above. They have been tested on a running instance of RHUI with the connection timeout set to 1 sec (which causes a heartbeat timeout within several seconds to several minutes), and the connection is successfully reestablished with no assertions.