Bug 1264461

Summary: goferd connection leak when using heartbeats
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: katello-agentAssignee: Justin Sherrill <jsherril>
Status: CLOSED ERRATA QA Contact: jcallaha
Severity: urgent Docs Contact:
Priority: high    
Version: 6.1.1CC: bbuckingham, bkearney, cwelton, gsim, jortel, mmccune, omaciel, sauchter, tscherf, xdmoon
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-15 18:40:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Moravec 2015-09-18 14:01:22 UTC
Description of problem:
Enabling heartbeats per bz1249890 and passing some traffic through qdrouterd->goferd connection, connection timeouts might happen as expected side effect - see

https://access.redhat.com/solutions/1614693

for reasoning.

BUT goferd does _not_ drop the timeouted TCP connections.


Version-Release number of selected component (if applicable):
gofer-2.6.2-2.el7sat.noarch with fix per bz1249890


How reproducible:
100%


Steps to Reproduce:
1. Have external capsule (content host should be fine as well), ensure heartbeats are enabled per bz1249890
2. on the capsule / content host, run "netstat -anp | grep 5647"
3. synchronize the capsule (or install/remove some package via hammer/WebUI to a content host)
4. Ensure goferd on capsule / content host detects some timeouted connections by logging to /var/log/messages:

Sep 18 15:47:54 my-capsule goferd: [INFO][worker-0] root:525 - Disconnected

5. repeat 2-4 as many times as you wish


Actual results:
- netstat will show increasing number of TCP connections from goferd(on capsule) to qdrouterd(on Satellite). Exactly 1 new connection for each "Disconnected" log.
- old TCP connections will have (sometimes increasing) Recv-Q (2nd column in "netstat -anp" output) - this is due to the fact qdrouterd sends heartbeat every second there, but goferd does not fetch the frames from the socket


Expected results:
- no connection leak
- Recv-Q should be zero almost every time


Additional info:

Comment 2 Pavel Moravec 2015-09-21 07:39:56 UTC
When disabling SSL, qdrouterd drops the connection while goferd keeps the connection in CLOSE_WAIT state.

Comment 5 Gordon Sim 2015-09-22 11:48:21 UTC
(In reply to Pavel Moravec from comment #4)
> Reproduced the bug outside gofer:
> 
> https://issues.apache.org/jira/browse/PROTON-
> 1000?focusedCommentId=14901368&page=com.atlassian.jira.plugin.system.
> issuetabpanels:comment-tabpanel#comment-14901368

I've committed a fix for this: https://git-wip-us.apache.org/repos/asf?p=qpid-proton.git;h=b8b2b34.

Comment 7 Jeff Ortel 2015-09-28 20:21:43 UTC
This is a general bug in the proton BlockingConnection: https://issues.apache.org/jira/browse/PROTON-1010.

Comment 10 Bryan Kearney 2015-09-29 13:51:32 UTC
Errata is created, moving back to NEW.

Comment 12 jcallaha 2015-10-07 16:00:44 UTC
Verified on Satellite-6.1.0-20151006.0

Installed and removed multiple packages on two content hosts with no errors.

[root@ibm-x3550m3-13 ~]# netstat -anp | grep 5647
tcp        0      0 10.34.36.141:54696      10.16.4.111:5647        ESTABLISHED 29378/python        
tcp        1      0 10.34.36.141:54695      10.16.4.111:5647        CLOSE_WAIT  29378/python        
[root@ibm-x3550m3-13 ~]# netstat -anp | grep 5647
tcp        0      0 10.34.36.141:54696      10.16.4.111:5647        ESTABLISHED 29378/python        
tcp        0      1 10.34.36.141:54814      10.16.4.111:5647        SYN_SENT    29378/python        
tcp        1      0 10.34.36.141:54695      10.16.4.111:5647        CLOSE_WAIT  29378/python        
[root@ibm-x3550m3-13 ~]# netstat -anp | grep 5647
tcp        0      0 10.34.36.141:54696      10.16.4.111:5647        ESTABLISHED 29378/python        
tcp       37      0 10.34.36.141:54814      10.16.4.111:5647        ESTABLISHED 29378/python        
tcp        1      0 10.34.36.141:54695      10.16.4.111:5647        CLOSE_WAIT  29378/python        


/var/log/messages output
------------------------
gofer.messaging.adapter.proton.connection:100 - connecting: URL: amqps://rhsm-qe-1.rhq.lab.eng.bos.redhat.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
root:485 - connecting to rhsm-qe-1.rhq.lab.eng.bos.redhat.com:5647...
gofer.messaging.adapter.proton.connection:103 - connected: amqps://rhsm-qe-1.rhq.lab.eng.bos.redhat.com:5647
root:505 - connected to rhsm-qe-1.rhq.lab.eng.bos.redhat.com:5647
gofer.rmi.dispatcher:600 - call: Content.install() sn=f73d1f92-c55c-495e-846a-709f99a87a4b data={u'task_id': u'97c5dae5-5eff-4f53-8c2f-a55deb746e84', u'consumer_id': u'4d010848-3728-4c2a-a0a7-d0cecf0f6ea4'}
pulp_rpm.handlers.rpmtools:164 - Installed: firefox-38.3.0-2.el7_1.x86_64
gofer.agent.rmi:129 - sn=f73d1f92-c55c-495e-846a-709f99a87a4b processed in: 1.126 (minutes)
root:525 - Disconnected

Comment 14 errata-xmlrpc 2015-10-15 18:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1910