Bug 1272758
Summary: | goferd memory leak when installing and removing a package to the content host | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | ||||
Component: | katello-agent | Assignee: | Jeff Ortel <jortel> | ||||
Status: | CLOSED ERRATA | QA Contact: | Peter Ondrejka <pondrejk> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.1.2 | CC: | akofink, aperotti, avaddara, bbuckingham, bmbouter, cwelton, daniele, daviddavis, dkliban, dnagarat, ggainey, ipanova, jortel, mcressma, mhrivnak, oshtaier, pcreech, rchan, sauchter, swadeley, ttereshc | ||||
Target Milestone: | Unspecified | Keywords: | Triaged | ||||
Target Release: | Unused | ||||||
Hardware: | noarch | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qpid-proton-0.9-16 | Doc Type: | Bug Fix | ||||
Doc Text: |
When repeatedly installing and removing a package on the same Content Host, goferd accumulates memory over time. This has been fixed by locally settling down received messages in qpid-proton library.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-02-21 12:57:01 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1353215, 1399395 | ||||||
Attachments: |
|
Description
Pavel Moravec
2015-10-18 14:19:08 UTC
Curiously, when replacing package (un)install by capsule sync (that goes via goferd as well), _no_ memory increase is observed at all. That suggests the leak is in gofer / katello-agent itself, in the parts manipulating with packages. I.e. the leak should not be in proton / proton reactor. (In reply to Pavel Moravec from comment #1) > Curiously, when replacing package (un)install by capsule sync (that goes via > goferd as well), _no_ memory increase is observed at all. > > That suggests the leak is in gofer / katello-agent itself, in the parts > manipulating with packages. I.e. the leak should not be in proton / proton > reactor. I have to take back this statetment.. Letting my reproducer (with capsule sync modification) running for longer time, it also leads to memory consumption increase of goferd. Though really slowly.. Found a (small) 384 byte / request memory leak in goferd which as been fixed upstream in gofer 2.7. Found a (big) 5 kB / message memory leak in python-qpid-proton and filed: https://issues.apache.org/jira/browse/PROTON-1120. PROTON-1120 marked as fixed in proton 0.12. This bug is _not_ fixed. 1) backporting PROTON-1120 to Sat6.1 (that should fix most of the leak) shows the same memory consumption leak as before. No improvement noticed. 2) Trying to reproduce on Sat6.2 beta: - PROTON-1120 is _not_ backported there - hit "testing-blocker" https://bugzilla.redhat.com/show_bug.cgi?id=1327236 (package install via goferd does not work in Sat6.2) (In reply to Jeff Ortel from comment #5) > Found a (small) 384 byte / request memory leak in goferd which as been fixed > upstream in gofer 2.7. Found a (big) 5 kB / message memory leak in > python-qpid-proton and filed: > https://issues.apache.org/jira/browse/PROTON-1120. Jeff, can you pls. point me to the fix in gofer for the small mem.leak and check if PROTON-1120 backport really helps with the big one? Since just backporting PROTON-1120, I see no improvement. The gofer patch: https://github.com/jortel/gofer/pull/50/files#diff-686b5bbf53616059db94e8e97d2d724bR163 was merged to master and was included in the 2.7.4+ builds. IIRC, I tested with the a source build of proton (master) using the commit hash referenced in the JIRA instead of applying the patch and observed the proton leak fixed. With both patches applied, I leak checking goferd (again) in isolation using proton and observed no memory growth. This suggests a more significant leak still exists in the katello-agent plugin. This plugin uses the Pulp agent handler framework which uses the YUM lib for RPM package operations. I think the YUM lib is a likely culprit and will focus my memory profiling efforts there to begin with. Running a test using YUM lib in isolation I see a growth of an RSS ~27 MB which grows and remains at an RSS ~194 MB. Using YUM causes a fair about of additional code and libs to be loaded. At this point, I'm not sure if the memory footprint can be attributed to this, or data stored by YUM in global variables. While running the test, I observed the RSS between ~250 MB and ~290 MB. TEST _____________________ import gc import os from time import sleep from yum import YumBase from yum.Errors import InstallError def install(*names): yb = YumBase() try: for pattern in names: yb.install(pattern=pattern) yb.resolveDeps() yb.processTransaction() finally: yb.close() def erase(*names): yb = YumBase() try: for pattern in names: yb.remove(pattern=pattern) yb.resolveDeps() yb.processTransaction() finally: yb.close() def pmap(pid): print os.system('pmap -x {0} | grep total'.format(pid)) if __name__ == '__main__': pid = os.getpid() pmap(pid) sleep(12) for n in range(3): erase('zsh') install('zsh') gc.collect() pmap(pid) sleep(30) Running the test 100 times, shows an initial increase of 174 MB then a steady increase of 80 MB to 290 MB. Or, ~800 KB per erase/install operation. Not suggesting this is the only thing to investigate .. just posting the information. ________________ total kB 335140 27040 13560 total kB 529304 201124 183532 total kB 521636 193536 175944 total kB 524524 196400 178808 total kB 527952 199904 182312 total kB 530720 202668 185076 total kB 532500 204460 186868 total kB 535984 208016 190424 total kB 539360 211304 193712 total kB 543200 215056 197464 total kB 545504 217344 199752 total kB 548320 220112 202520 total kB 551136 223036 205444 total kB 553184 225052 207460 total kB 555232 227116 209524 total kB 557536 229568 211976 ..... ..... total kB 608112 280036 262444 total kB 608624 280568 262976 total kB 608112 280048 262456 total kB 608368 280228 262636 total kB 608880 280780 263188 total kB 608880 280868 263276 total kB 609392 281292 263700 total kB 610160 282064 264472 total kB 608880 280912 263320 The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug. The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug. Bug filed against YUM: https://bugzilla.redhat.com/show_bug.cgi?id=1333157 FYI I think there is a memory leak even with qpid-proton-0.9-16 outside yum BZ. When I add to /usr/lib/python2.7/site-packages/gofer/rmi/dispatcher.py : def dispatch(self, document): """ Dispatch the requested RMI. :param document: A request document. :type document: Document :return: The result. :rtype: any """ try: self.log(document) auth = self.auth(document) request = Request(document.request) raise NotPermitted(request) # <------------- added method = RMI(request, auth, self.catalog) return method() except Exception: log.exception(str(document)) return Return.exception() to raise exception instead of calling the RMI command, then: - no package is (un)installed by goferd - memory increase is still present Created attachment 1155034 [details] tcpdump of unencrypted goferd<->qdrouterd communication Testing more the 0.9-16 proton version, there is something fishy on more places: 1) taking tcpdump from unencrypted AMQP communication of goferd<->qdrouterd, goferd or proton sends data to CLOSE_WAIT connection (!). Open the attached tcpdump and apply filter "tcp.stream eq 4" to see that. in this TCP stream: - gofer responds on some sender to something - then it gets idle - connection timeouts so it's closed on AMQP and also TCP layer from qdrouterd side - later on, gofer tries to attach a sender and closes AMQP session - this time, the TCP connection is CLOSE_WAIT already I *think* this has something common to the mem.leak. 2) Trying to reproduce it, I found 3 different places where some memory is leaked: 2a) reproducer: sleep(100) for i in range(1,100): sleep(1) try: conn = BlockingConnection("proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5647", ssl_domain=None, heartbeat=1) print "conn created" sleep(2) conn.close() conn = None except Exception, e: print e conn.close() pass print "done" sleep(100) shows memory increase (taken during "sleep(100)") 2244kB in RSS. 2b) reproducer: sleep(100) for i in range(1,100): sleep(1) try: conn = BlockingConnection("proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5647", ssl_domain=None, heartbeat=1) print "conn created" sleep(2) try: snd = conn.create_sender("pulp.bz1272758", name=str(uuid4())) except Exception: pass conn.close() conn = None except Exception, e: print e conn.close() pass print "done" sleep(100) shows memory increase (taken during "sleep(100)") 4872kB in RSS. So adding the conn.create_sender on timeouted but not yet closed connection adds another 2.6MB leak (per 100 calls). 2c) reproducer: sleep(100) for i in range(1,100): sleep(1) try: conn = BlockingConnection("proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5647", ssl_domain=None, heartbeat=2) print "conn created" sleep(20) try: snd = conn.create_sender("pulp.bz1272758", name=str(uuid4())) snd.close() except Exception: print "passed exception" pass if (conn): conn.close() snd = None conn = None except Exception, e: print e conn.close() pass print "done" sleep(100) shows memory increase (taken during "sleep(100)") 3804kB in RSS, so even calling create_sender on timeouted and CLOSE_WAIT connection, it consumes additionaly 1.6MB of memory (per 100 calls) (this reproducer mimics the sending of attach frame on CLOSE_WAIT connection) The Pulp upstream bug priority is at High. Updating the external tracker on this bug. The Pulp upstream bug status is at CLOSED - WONTFIX. Updating the external tracker on this bug. The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug. All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST. Here is the relevant bug: https://bugzilla.redhat.com/show_bug.cgi?id=1334916 Would it be possible to build a hotfix for Satellite 6.2.z? Satellite 6.2.7 Versions: qpid-proton-c-0.9-16.el7.x86_64 pulp-server-2.8.7.5-1.el7sat.noarch Looks like the actual fix: https://bugzilla.redhat.com/show_bug.cgi?id=1334916 has already been released in RHEL. Suggest upgrading subscription-manager. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0338 |