Bug 1444206
| Summary: | python-qpid crashes with "assert rcv.received < rcv.impending" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Derek Whatley <dwhatley> | ||||
| Component: | python-qpid | Assignee: | Kim van der Riet <kim.vdriet> | ||||
| Status: | ASSIGNED --- | QA Contact: | Messaging QE <messaging-qe-bugs> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | Development | CC: | jross, kim.vdriet | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Derek Whatley
2017-04-20 20:44:25 UTC
After speaking with Jeff Ortel, I believe the core issue is connection failure between the RHUA's qpid broker and the RHUA's qpid client. Connections from localhost (qpidd broker :5674) to localhost (pulp qpid client) are entering an "error" state as indicated by /var/log/messages on the RHUA: sudo grep -i 'qpid' /var/log/messages* | less # Error messages start piling up in /var/log/messages qpidd[9301]: 2017-04-05 22:54:05 error Connection 127.0.0.1:5674-127.0.0.1:46880 timed out: closing qpidd[9301]: 2017-04-06 11:03:55 error Connection 127.0.0.1:5674-127.0.0.1:54796 timed out: closing qpidd[9301]: 2017-04-06 23:36:19 error Connection 127.0.0.1:5674-127.0.0.1:36286 timed out: closing qpidd[9301]: 2017-04-07 12:14:26 error Connection 127.0.0.1:5674-127.0.0.1:55290 timed out: closing # RHUA is unable to connect to CDSs and is now restarted. qpidd[9301]: 2017-04-07 17:08:03 notice Shut down qpidd[29184]: 2017-04-07 17:08:03 notice SASL disabled: No Authentication Performed qpidd[29184]: 2017-04-07 17:08:03 notice Listening on TCP/TCP6 port 5672 qpidd[29184]: 2017-04-07 17:08:03 notice Listening for SSL connections on TCP port 5674 qpidd[29184]: 2017-04-07 17:08:03 notice Broker running # In this example, the problem returns after ~5 hours. qpidd[29184]: 2017-04-07 22:00:55 error Connection 127.0.0.1:5674-127.0.0.1:60316 timed out: closing qpidd[29184]: 2017-04-08 04:10:22 error Connection 127.0.0.1:5674-127.0.0.1:51204 timed out: closing qpidd[29184]: 2017-04-08 16:07:38 error Connection 127.0.0.1:5674-127.0.0.1:41936 timed out: closing qpidd[29184]: 2017-04-08 22:02:55 error Connection 127.0.0.1:5674-127.0.0.1:44304 timed out: closing qpidd[29184]: 2017-04-09 10:03:27 error Connection 127.0.0.1:5674-127.0.0.1:56836 timed out: closing
> # RHUA is unable to connect to CDSs and is now restarted.
To clarify: at this point the pulp-server (qpidd, httpd, mongod) service running on the RHUA is restarted. The entire RHUA machine is _not_ restarted.
Created attachment 1284486 [details]
Proposed fix
Investigation of this issue revealed 3 issues:
1. The heartbeat timeout value used by gofer (10sec) is rather low, and the timeout error occurs when a processing delay on any node of > 20 secs (2 * timeout) occurs. For these instances, that seems to occur every couple of hours. The heartbeat value should be set to a higher value (suggest 60 sec).
2. Once a timeout occurs, the driver Engine object is closed, but no check is made elsewhere in the driver for this state, and as the delay in point 1. above is temporary, once normal processing continues, message transfers are being attempted on the closed Engine object, resulting in the Python error "assert rcv.received < rcv.impending".
3. In addition to this, a bug in the driver logic prevents the connection from being reestablished.
This patch addresses issues 2. and 3. above. They have been tested on a running instance of RHUI with the connection timeout set to 1 sec (which causes a heartbeat timeout within several seconds to several minutes), and the connection is successfully reestablished with no assertions.
|