Bug 1272758

Summary: goferd memory leak when installing and removing a package to the content host
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: katello-agentAssignee: Jeff Ortel <jortel>
Status: CLOSED ERRATA QA Contact: Peter Ondrejka <pondrejk>
Severity: high Docs Contact:
Priority: high    
Version: 6.1.2CC: akofink, aperotti, avaddara, bbuckingham, bmbouter, cwelton, daniele, daviddavis, dkliban, dnagarat, ggainey, ipanova, jortel, mcressma, mhrivnak, oshtaier, pcreech, rchan, sauchter, swadeley, ttereshc
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: noarch   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-proton-0.9-16 Doc Type: Bug Fix
Doc Text:
When repeatedly installing and removing a package on the same Content Host, goferd accumulates memory over time. This has been fixed by locally settling down received messages in qpid-proton library.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-21 12:57:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1353215, 1399395    
Attachments:
Description Flags
tcpdump of unencrypted goferd<->qdrouterd communication none

Description Pavel Moravec 2015-10-18 14:19:08 UTC
Description of problem:
Installing and removing a package on the same Content Host in a loop, goferd on the C.H. accumulates memory.

The increase isn't huge but evident after several iterations already.


Version-Release number of selected component (if applicable):
Satellite 6.1.3
gofer-2.6.6-2.el7sat.noarch
python-gofer-2.6.6-2.el7sat.noarch
python-gofer-proton-2.6.6-2.el7sat.noarch
python-qpid-proton-0.9-7.el7.x86_64
qpid-proton-c-0.9-7.el7.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Run this script on Satellite:

while true; do 
  hammer -u admin -p adminPassword content-host package remove --content-host-id UUID-of-contentHost --organization-id 1 --packages sysstat
  sleep 2
  hammer -u admin -p adminPassword content-host package install --content-host-id UUID-of-contentHost --organization-id 1 --packages sysstat
  sleep 2
done

2. Monitor goferd memory usage on the content host, e.g.:

while true; do ps aux | grep -v grep | grep goferd; sleep 10; done


Actual results:
Memory consumption grows over time. (I got goferd with 250MB RSS)

Stopping the (un)install script just stops memory growth - already allocated memory isnt released back.


Expected results:
No memory growth.


Additional info:
I tried to manually disable heartbeats and/or switch to nonSSL traffic, with ambigous results - memory was growing in either case, but less and until some threshold only..

Comment 1 Pavel Moravec 2015-10-18 14:21:27 UTC
Curiously, when replacing package (un)install by capsule sync (that goes via goferd as well), _no_ memory increase is observed at all.

That suggests the leak is in gofer / katello-agent itself, in the parts manipulating with packages. I.e. the leak should not be in proton / proton reactor.

Comment 3 Pavel Moravec 2015-10-18 16:18:45 UTC
(In reply to Pavel Moravec from comment #1)
> Curiously, when replacing package (un)install by capsule sync (that goes via
> goferd as well), _no_ memory increase is observed at all.
> 
> That suggests the leak is in gofer / katello-agent itself, in the parts
> manipulating with packages. I.e. the leak should not be in proton / proton
> reactor.

I have to take back this statetment..

Letting my reproducer (with capsule sync modification) running for longer time, it also leads to memory consumption increase of goferd. Though really slowly..

Comment 5 Jeff Ortel 2016-02-01 19:31:13 UTC
Found a (small) 384 byte / request memory leak in goferd which as been fixed upstream in gofer 2.7.  Found a (big) 5 kB / message memory leak in python-qpid-proton and filed: https://issues.apache.org/jira/browse/PROTON-1120.

Comment 6 Jeff Ortel 2016-02-02 15:16:41 UTC
PROTON-1120 marked as fixed in proton 0.12.

Comment 8 Pavel Moravec 2016-04-14 14:33:52 UTC
This bug is _not_ fixed.

1) backporting PROTON-1120 to Sat6.1 (that should fix most of the leak) shows the same memory consumption leak as before. No improvement noticed.

2) Trying to reproduce on Sat6.2 beta:
- PROTON-1120 is _not_ backported there
- hit "testing-blocker" https://bugzilla.redhat.com/show_bug.cgi?id=1327236 (package install via goferd does not work in Sat6.2)

Comment 10 Pavel Moravec 2016-04-18 06:11:47 UTC
(In reply to Jeff Ortel from comment #5)
> Found a (small) 384 byte / request memory leak in goferd which as been fixed
> upstream in gofer 2.7.  Found a (big) 5 kB / message memory leak in
> python-qpid-proton and filed:
> https://issues.apache.org/jira/browse/PROTON-1120.

Jeff,
can you pls. point me to the fix in gofer for the small mem.leak and check if PROTON-1120 backport really helps with the big one? Since just backporting PROTON-1120, I see no improvement.

Comment 11 Jeff Ortel 2016-04-18 15:52:07 UTC
The gofer patch: https://github.com/jortel/gofer/pull/50/files#diff-686b5bbf53616059db94e8e97d2d724bR163 was merged to master and was included in the 2.7.4+ builds.  IIRC, I tested with the a source build of proton (master) using the commit hash referenced in the JIRA instead of applying the patch and observed the proton leak fixed.  With both patches applied, I leak checking goferd (again) in isolation using proton and observed no memory growth.  This suggests a more significant leak still exists in the katello-agent plugin.  This plugin uses the Pulp agent handler framework which uses the YUM lib for RPM package operations.  I think the YUM lib is a likely culprit and will focus my memory profiling efforts there to begin with.

Comment 12 Jeff Ortel 2016-04-19 15:14:49 UTC
Running a test using YUM lib in isolation I see a growth of an RSS ~27 MB which grows and remains at an RSS ~194 MB.  Using YUM causes a fair about of additional code and libs to be loaded.  At this point, I'm not sure if the memory footprint can be attributed to this, or data stored by YUM in global variables.  While running the test, I observed the RSS between ~250 MB and ~290 MB.

TEST
_____________________
import gc
import os

from time import sleep
from yum import YumBase
from yum.Errors import InstallError


def install(*names):
    yb = YumBase()
    try:
        for pattern in names:
            yb.install(pattern=pattern)
        yb.resolveDeps()
        yb.processTransaction()
    finally:
        yb.close()


def erase(*names):
    yb = YumBase()
    try:
        for pattern in names:
            yb.remove(pattern=pattern)
        yb.resolveDeps()
        yb.processTransaction()
    finally:
        yb.close()


def pmap(pid):
    print os.system('pmap -x {0} | grep total'.format(pid))


if __name__ == '__main__':
    pid = os.getpid()
    pmap(pid)
    sleep(12)
    for n in range(3):
        erase('zsh')
        install('zsh')
    gc.collect()
    pmap(pid)
    sleep(30)

Comment 13 Jeff Ortel 2016-04-19 16:35:09 UTC
Running the test 100 times, shows an initial increase of 174 MB then a steady increase of 80 MB to 290 MB.  Or, ~800 KB per erase/install operation.  Not suggesting this is the only thing to investigate .. just posting the information. 
________________
total kB          335140   27040   13560
total kB          529304  201124  183532
total kB          521636  193536  175944
total kB          524524  196400  178808
total kB          527952  199904  182312
total kB          530720  202668  185076
total kB          532500  204460  186868
total kB          535984  208016  190424
total kB          539360  211304  193712
total kB          543200  215056  197464
total kB          545504  217344  199752
total kB          548320  220112  202520
total kB          551136  223036  205444
total kB          553184  225052  207460
total kB          555232  227116  209524
total kB          557536  229568  211976
.....
.....
total kB          608112  280036  262444
total kB          608624  280568  262976
total kB          608112  280048  262456
total kB          608368  280228  262636
total kB          608880  280780  263188
total kB          608880  280868  263276
total kB          609392  281292  263700
total kB          610160  282064  264472
total kB          608880  280912  263320

Comment 15 pulp-infra@redhat.com 2016-05-02 17:23:15 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 16 pulp-infra@redhat.com 2016-05-02 17:23:18 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 18 Jeff Ortel 2016-05-04 19:45:30 UTC
Bug filed against YUM: https://bugzilla.redhat.com/show_bug.cgi?id=1333157

Comment 20 Pavel Moravec 2016-05-06 18:29:17 UTC
FYI I think there is a memory leak even with qpid-proton-0.9-16 outside yum BZ.

When I add to /usr/lib/python2.7/site-packages/gofer/rmi/dispatcher.py :

    def dispatch(self, document):
        """
        Dispatch the requested RMI.
        :param document: A request document.
        :type document: Document
        :return: The result.
        :rtype: any
        """
        try:
            self.log(document)
            auth = self.auth(document)
            request = Request(document.request)
            raise NotPermitted(request)  # <------------- added
            method = RMI(request, auth, self.catalog)
            return method()
        except Exception:
            log.exception(str(document))
            return Return.exception()

to raise exception instead of calling the RMI command, then:
- no package is (un)installed by goferd
- memory increase is still present

Comment 21 Pavel Moravec 2016-05-08 10:30:26 UTC
Created attachment 1155034 [details]
tcpdump of unencrypted goferd<->qdrouterd communication

Testing more the 0.9-16 proton version, there is something fishy on more places:

1) taking tcpdump from unencrypted AMQP communication of goferd<->qdrouterd, goferd or proton sends data to CLOSE_WAIT connection (!). Open the attached tcpdump and apply filter "tcp.stream eq 4" to see that.

in this TCP stream:
- gofer responds on some sender to something
- then it gets idle
- connection timeouts so it's closed on AMQP and also TCP layer from qdrouterd side
- later on, gofer tries to attach a sender and closes AMQP session - this time, the TCP connection is CLOSE_WAIT already

I *think* this has something common to the mem.leak.

2) Trying to reproduce it, I found 3 different places where some memory is leaked:

2a) reproducer:
sleep(100)
for i in range(1,100):
  sleep(1)
  try:
    conn = BlockingConnection("proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5647", ssl_domain=None, heartbeat=1)
    print "conn created"
    sleep(2)
    conn.close()
    conn = None
  except Exception, e:
    print e
    conn.close()
    pass
print "done"
sleep(100)

shows memory increase (taken during "sleep(100)") 2244kB in RSS.


2b) reproducer:
sleep(100)
for i in range(1,100):
  sleep(1)
  try:
    conn = BlockingConnection("proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5647", ssl_domain=None, heartbeat=1)
    print "conn created"
    sleep(2)
    try:
      snd = conn.create_sender("pulp.bz1272758", name=str(uuid4()))
    except Exception:
      pass
    conn.close()
    conn = None
  except Exception, e:
    print e
    conn.close()
    pass
print "done"
sleep(100)

shows memory increase (taken during "sleep(100)") 4872kB in RSS. So adding the conn.create_sender on timeouted but not yet closed connection adds another 2.6MB leak (per 100 calls).

2c) reproducer:
sleep(100)
for i in range(1,100):
  sleep(1)
  try:
    conn = BlockingConnection("proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5647", ssl_domain=None, heartbeat=2)
    print "conn created"
    sleep(20)
    try:
      snd = conn.create_sender("pulp.bz1272758", name=str(uuid4()))
      snd.close()
    except Exception:
      print "passed exception"
      pass
    if (conn):
      conn.close()
    snd = None
    conn = None
  except Exception, e:
    print e
    conn.close()
    pass
print "done"
sleep(100)

shows memory increase (taken during "sleep(100)") 3804kB in RSS, so even calling create_sender on timeouted and CLOSE_WAIT connection, it consumes additionaly 1.6MB of memory (per 100 calls)

(this reproducer mimics the sending of attach frame on CLOSE_WAIT connection)

Comment 25 pulp-infra@redhat.com 2016-05-23 14:00:28 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 27 pulp-infra@redhat.com 2016-11-14 17:02:11 UTC
The Pulp upstream bug status is at CLOSED - WONTFIX. Updating the external tracker on this bug.

Comment 28 pulp-infra@redhat.com 2016-11-14 18:32:13 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 29 pulp-infra@redhat.com 2016-11-21 21:01:59 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 31 Jeff Ortel 2016-11-28 21:49:21 UTC
Here is the relevant bug:  https://bugzilla.redhat.com/show_bug.cgi?id=1334916

Comment 33 Andrew Kofink 2017-02-22 15:11:49 UTC
Would it be possible to build a hotfix for Satellite 6.2.z?

Satellite 6.2.7 Versions:
qpid-proton-c-0.9-16.el7.x86_64
pulp-server-2.8.7.5-1.el7sat.noarch

Comment 34 Jeff Ortel 2017-02-22 17:52:29 UTC
Looks like the actual fix: https://bugzilla.redhat.com/show_bug.cgi?id=1334916 has already been released in RHEL.  Suggest upgrading subscription-manager.

Comment 40 errata-xmlrpc 2018-02-21 12:57:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0338