Description of problem: Some race condition when (un)installing a package (or I guess also an errata) from Satellite6 to a Content Host registered to external Capsule triggers memory leak in qdrouterd. Debugging more, the number of qdrouterd links is increasing over time (see reproducer). It is not known ATM if same scenario applies to qdrouterd on Satellite as well (it is expected, though). Version-Release number of selected component (if applicable): qpid-dispatch-router-0.4-11.el7.x86_64 How reproducible: 100% in 30 mins Steps to Reproduce: 1. Have Sat6, external Capsule and a Content Host registered to the Capsule (ensure goferd on the client host is running) 2. On Capsule, monitor every minute or so: qdstat -b 0.0.0.0:5647 --ssl-certificate /etc/pki/katello/qpid_router_client.crt --ssl-key /etc/pki/katello/qpid_router_client.key --ssl-trustfile /etc/pki/katello/certs/katello-default-ca.crt -l ps aux | grep qdrouterd | grep -v grep (how to get qdstat: from qpid-dispatch-tools package - not accessible for capsules by default, use brew or scp /usr/bin/qdstat from Sat) 3. In a loop, install and uninstall a package to/from the Content Host: action=install while true; do hammer -u admin -p adminPassword content-host package $action --organization-id 1 --content-host-id hostUUID --packages sos if [ "$action" == "install" ]; then action=remove else action=install fi date sleep 5 date done 4. Monitor outputs of 2. Actual results: memory of qdrouterd grows over time links list in qdstat -l output grows over time Expected results: Neither memory or number of qdrouter links grows over time Additional info:
(just restarting goferd is _not_ a reproducer leading to the leak, btw)
Standalone reproducer outside Satellite6, using just a python script calling proton reactor API: 1) Have qpid broker and link routing to/from pulp.task queue there (create the queue in qpidd) 2) Have this script connecting to qdrouterd on port 5648 (reproducible independently on SSL used or not): # cat bz1312419-qdrouterd-memLeak-links.py #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton import Message from proton.utils import BlockingConnection import traceback import random conn = BlockingConnection("proton+amqp://0.0.0.0:5648", ssl_domain=None, heartbeat=2) while True: sleep(random.uniform(0.3,3)) try: snd = conn.create_sender("pulp.task", name=str(uuid4())) print "sleeping.." sleep(2.5) snd.send(Message(body="test", durable=True)) snd.close() except ConnectionException: try: if conn: conn.close() conn = BlockingConnection("proton+amqp://0.0.0.0:5648", ssl_domain=None, heartbeat=2) except Exception, e: print e print(traceback.format_exc()) # 3) Call the script - calling it sequentially, you see the mem.leak (use qdstat -l and ps commands to verify). It seems the bug is not deterministic, link leak occurs sometimes only. 4) Call the script 10times in parallel and qdrouterd segfaults in very few minutes with backtrace: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffe99fc700 (LWP 18411)] qd_link_pn (link=link@entry=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:768 768 return link->pn_link; (gdb) bt full #0 qd_link_pn (link=link@entry=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:768 No locals. #1 0x00007ffff7bc5c28 in qd_router_flow (context=0x7fffc8046230, discard=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1261 link = 0x0 pn_link = <optimized out> delta = <optimized out> le = 0x7fffc8046230 #2 0x00007ffff7bc8f36 in invoke_deferred_calls (conn=conn@entry=0x7fffd8023860, discard=discard@entry=false) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:304 calls = {head = 0x0, tail = <optimized out>, scratch = 0x7fffc8047770, size = <optimized out>} dc = 0x7fffc8047770 #3 0x00007ffff7bc9bed in process_connector (cxtr=0x7fffd8015c00, qd_server=0x8fcc10) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:397 ctx = 0x7fffd8023860 events = <optimized out> passes = <optimized out> #4 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 work_done = 1 timer = <optimized out> thread = <optimized out> work = <optimized out> cxtr = 0x7fffd8015c00 conn = <optimized out> ctx = <optimized out> error = <optimized out> poll_result = <optimized out> qd_server = 0x8fcc10 #5 0x00007ffff773bdc5 in start_thread (arg=0x7fffe99fc700) at pthread_create.c:308 __res = <optimized out> pd = 0x7fffe99fc700 now = <optimized out> My understanding of the error: - key factor: send a message to qdrouterd on a heartbeat-timeouted sender - seems the router (sometimes?) re-enables(?) the link - here the link leak origins - when the link is attempted to be used (by some real connection/link?), segfault can occur
Trying to reproduce this again (if backtrace can be a different one matching a customer _different_ backtrace), some new observations found: 1) segfault appears more often on the Satellite's qdrouterd (following the python script reproducer, outside Sat) than on Capsule's qdrouterd. 2) backtrace is always the same (so far)
(In reply to Pavel Moravec from comment #4) > Trying to reproduce this again (if backtrace can be a different one matching > a customer _different_ backtrace), some new observations found: > > 1) segfault appears more often on the Satellite's qdrouterd (following the > python script reproducer, outside Sat) than on Capsule's qdrouterd. I mean: having link routing from Capsule's qdrouterd via Satellite's qdrouterd to qpidd on Satellite, and running the script against Capsule's qdrouterd, it is qdrouterd on _Satellite_ that segfaults more often. With same backtace.
There is another bug open as well around what happens if you uninstall and install a package over and over again, there is tasks failing too when you do this. Is this in anyway related? bz 1293911
(In reply to mlinden from comment #6) > There is another bug open as well around what happens if you uninstall and > install a package over and over again, there is tasks failing too when you > do this. Is this in anyway related? bz 1293911 Sounds two separate issues: - here (bz1312419) about a FD leak in qdrouterd - there (bz1293911) about concurrency bug in Actions::Katello::System::GenerateApplicability task (that apparently cant be run in parallel) Reproducer from bz1293911 does not trigger bz1312419 since it does not use qdrouterd / goferd. Reproducer from this bz1312419 *can* trigger bz1293911 since it installs and uninstalls a package, that triggers GenerateApplicability as well (though with lower frequency since there is some overhead in katello agent / goferd that needs some time to really call yum command)
(In reply to Pavel Moravec from comment #3) > My understanding of the error: > - key factor: send a message to qdrouterd on a heartbeat-timeouted sender > - seems the router (sometimes?) re-enables(?) the link - here the link leak > origins > - when the link is attempted to be used (by some real connection/link?), > segfault can occur Another backtrace of qdrouterd segfault that is _assumed_ to have same cause: #0 pni_record_find (record=<optimized out>, record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71 #1 pn_record_get (record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:120 #2 0x00007f4d633323c3 in pn_connection_get_context (conn=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184 #3 0x00007f4d63575e25 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:780 #4 0x00007f4d63583445 in router_link_attach_handler (context=0x2382e40, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1612 #5 0x00007f4d63575285 in handle_link_open (container=<optimized out>, pn_link=0x7f4d4e790090) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:201 #6 process_handler (unused=<optimized out>, qd_conn=0x7f4d5000cbe0, container=0x22f8620) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:389 #7 handler (handler_context=0x22f8620, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f4d5000cbe0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:496 #8 0x00007f4d6358742c in process_connector (cxtr=0x7f4d50010340, qd_server=0x21f9b80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:394 #9 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:622 Here, the cause is link assignment failed (due to all links consumed/in use) and proton returned null instead of pointer to a link object. This sounds possible for the original backtrace as well.
FYI on 0.4-11 (I guess on any other as well), the leak is visible in coredump when printing router links: (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007fe26acc985f in sys_cond_wait (cond=<optimized out>, held_mutex=0x1fb5e10) at /usr/src/debug/qpid-dispatch-0.4/src/posix/threading.c:107 #2 0x00007fe26acd2a79 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:508 #3 0x00007fe26acd3820 in qd_server_run (qd=0x1cf4030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:961 #4 0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffee370bf2d "/etc/qpid-dispatch/qdrouterd.conf", python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135 #5 0x0000000000401950 in main (argc=3, argv=0x7ffee3709de8) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335 (gdb) frame 3 #3 0x00007fe26acd3820 in qd_server_run (qd=0x1cf4030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:961 961 thread_run((void*) qd_server->threads[0]); (gdb) p qd_server->qd->router->links $1 = {head = 0x7fe24000c240, tail = 0x20cfa80, scratch = 0x0, size = 52} (gdb) Here, router claims it has 52 links while it had approx. 5 only.
Created attachment 1170419 [details] hotfix packages
@Pavel, I was attempting to verify this bug on Sat 6.2 GA Snap 20.1 and I observed that the links are not increasing but the memory leak is still there: The observation within half hour of run: 1. Links: After hour, the link is still only one which is expected as below: -> # qdstat -b 0.0.0.0:5647 --ssl-certificate /etc/pki/katello/qpid_router_client.crt --ssl-key /etc/pki/katello/qpid_router_client.key --ssl-trustfile /etc/pki/katello/certs/katello-default-ca.crt -l Router Links type dir rindex class address phase event-fifo msg-fifo ================================================================================ inter-router in 0 - - - 0 - inter-router out 1 local qdhello 0 0 endpoint out - - - - 0 0 endpoint in - - - - 0 - endpoint in - - - - 0 - endpoint out - local temp.ZmfbPJOoog 0 0 # qdstat -l LinkDetached: sender 0769f652-0d61-4ed4-b5a2-84416dacaeaf-$management to $management closed due to: Condition('amqp:not-found', 'Node not found: $management') # qdstat -l LinkDetached: sender 9ad682e1-bf40-4d58-a929-9f26390844eb-$management to $management closed due to: Condition('amqp:not-found', 'Node not found: $management') # qdstat -l LinkDetached: sender 0bd2a4f7-29b7-485c-83be-2b3ed37c00da-$management to $management closed due to: Condition('amqp:not-found', 'Node not found: $management') 2. The memory leak is still observed: Every minute its occupying an average of 100kb memory as below: # ps aux | grep qdrouterd | grep -v grep qdroute+ 13199 0.1 0.1 440744 24108 ? Ssl Jul13 2:00 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf # ps aux | grep qdrouterd | grep -v grep qdroute+ 13199 0.1 0.1 440744 24312 ? Ssl Jul13 2:00 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf # ps aux | grep qdrouterd | grep -v grep qdroute+ 13199 0.1 0.1 440888 24496 ? Ssl Jul13 2:01 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf # ps aux | grep qdrouterd | grep -v grep qdroute+ 13199 0.1 0.1 440888 24580 ? Ssl Jul13 2:01 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf # ps aux | grep qdrouterd | grep -v grep qdroute+ 13199 0.1 0.1 440888 24764 ? Ssl Jul13 2:01 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf This behavior is still doubtfull to make this bug as Verified. So can you please confirm whether to verify this bug or reassign to fix ?
I second that even the 0.4-12 package suffers by memory leak - see my previous updates. I think the reproducer hit: - a link leak in qdrouterd that causes immediate memory leak and also segfault after a while - this seems to be fixed - another memory leak (approx. 10kB per one package install) that is still unfixed Technically, the BZ has not been completely fixed since the same reproducer is still valid for one of two negative consequences. The still-present memory leak has simple reproducer based on a .py script using proton reactor library only - no Satellite bits. See https://bugzilla.redhat.com/show_bug.cgi?id=1312419#c32 or the script in some above update. We can have one BZ for one separate memory leak, claiming this BZ fixes just the link leak one and new one the remaining leak. But then I would highly suggest have it fixed in 6.2.0 as well - because there is feedback from Andrew's customer the 0.4-12 version can consume all RAM of Satellite within a day. This does not sound - in customer eyes - we really fix the leak.. To sum it up, I recommend either: - mark this as verified and open new blocker+ sat6.2.0+ BZ requesting to fix the remaining leak - or return this as failedQE since a nontrivial leak is still present (and I warned about this in past already)
@Pavel, I would like to return this bug in Assigned (Failed QE) state due to memory leak issue is still there and its good to keep the whole story in one bug only instead of branching them. So as per my comment 44 and comment 45, I am moving this bz to Assigned state. Thanks, Jitendra
I believe this failed because the tests were done on qpid-dispatch-0.4-11 because we had not actually pulled in 0.4-12 to the Satellite repositories yet. This was accidentally put to the ON_QA state due fields being set on the bug that told our tooling this was ready to go to ON_QA. I will be properly tagging and bringing this in and flipping this back to ON_QA when 0.4-12 is in the repositories.
(In reply to Eric Helms from comment #48) > I believe this failed because the tests were done on qpid-dispatch-0.4-11 > because we had not actually pulled in 0.4-12 to the Satellite repositories > yet. This was accidentally put to the ON_QA state due fields being set on > the bug that told our tooling this was ready to go to ON_QA. I will be > properly tagging and bringing this in and flipping this back to ON_QA when > 0.4-12 is in the repositories. OK, anyway per my private testing of 0.4-12, the 2nd mem.leak is still even there - see comment #30 and #32.
lets leave this bug as-is and continue to see if we can find a resolution to the memory leak (vs closing this one out). The end user just needs the leak resolved and will work towards finding a solution.
Hi Eric, The rate of increase is same as 0.4-13 and I dont find any different that 0.4-12. The rate is same at an avg. of 100KB per minute mentioned in my last test comment 44. If that rate is expected, then I am happy to close this bug.
Verified ! @ Sat 6.2 Snap 21.0 GA As per discussions and half hour test done on qpid 0.4-13, I can say that : 1. the number of links are not growing over time and is fixed. 2. The memory leak is still observed but as compare to earlier versions / snaps it looks better and the minor memory leak is observed. The more details about memory leak on 0.4-13 is in comment 51 which i FailedQa earlier but as compare to earlier snap and the minor memory leak is acceptable. Also for that minor leak a new bug has been created: https://bugzilla.redhat.com/show_bug.cgi?id=1358948 which will be fixed in later z-stream builds. So as of now verifying this bug in the impression of minor leak is still observed but will be fixed in z- stream builds.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1501
Created attachment 1189617 [details] Valgrind output for qdrouterd that ran with reproducer from comment #1 (5 loops)
Created attachment 1189618 [details] Valgrind output for qdrouterd that ran with reproducer from comment #1 (50 loops)