Bug 1312419 - Memory leak and occasional segfault in qdrouterd when (un)installing a package from Satellite6
Summary: Memory leak and occasional segfault in qdrouterd when (un)installing a packag...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: katello-agent
Version: 6.1.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Jitendra Yejare
URL:
Whiteboard:
Depends On:
Blocks: 1328299
TreeView+ depends on / blocked
 
Reported: 2016-02-26 16:19 UTC by Pavel Moravec
Modified: 2021-08-30 10:44 UTC (History)
17 users (show)

Fixed In Version: qpid-dispatch-0.4-13
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1328299 (view as bug list)
Environment:
Last Closed: 2016-07-27 11:02:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
hotfix packages (580.00 KB, application/x-tar)
2016-06-21 21:53 UTC, Mike McCune
no flags Details
Valgrind output for qdrouterd that ran with reproducer from comment #1 (5 loops) (5.18 MB, text/plain)
2016-08-10 12:28 UTC, Barbora Vassova
no flags Details
Valgrind output for qdrouterd that ran with reproducer from comment #1 (50 loops) (5.30 MB, text/plain)
2016-08-10 12:29 UTC, Barbora Vassova
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1340375 0 medium CLOSED Backport DISPATCH-154 to Satellite / qpid-dispatch-router 2021-08-30 10:43:40 UTC
Red Hat Knowledge Base (Solution) 2280251 0 None None None 2016-04-26 13:28:45 UTC

Internal Links: 1340375

Description Pavel Moravec 2016-02-26 16:19:55 UTC
Description of problem:
Some race condition when (un)installing a package (or I guess also an errata) from Satellite6 to a Content Host registered to external Capsule triggers memory leak in qdrouterd.

Debugging more, the number of qdrouterd links is increasing over time (see reproducer).

It is not known ATM if same scenario applies to qdrouterd on Satellite as well (it is expected, though).


Version-Release number of selected component (if applicable):
qpid-dispatch-router-0.4-11.el7.x86_64


How reproducible:
100% in 30 mins


Steps to Reproduce:
1. Have Sat6, external Capsule and a Content Host registered to the Capsule (ensure goferd on the client host is running)
2. On Capsule, monitor every minute or so:

qdstat -b 0.0.0.0:5647 --ssl-certificate /etc/pki/katello/qpid_router_client.crt --ssl-key /etc/pki/katello/qpid_router_client.key --ssl-trustfile /etc/pki/katello/certs/katello-default-ca.crt -l

ps aux | grep qdrouterd | grep -v grep

(how to get qdstat: from qpid-dispatch-tools package - not accessible for capsules by default, use brew or scp /usr/bin/qdstat from Sat)

3. In a loop, install and uninstall a package to/from the Content Host:

action=install
while true; do
  hammer -u admin -p adminPassword content-host package $action --organization-id 1 --content-host-id hostUUID --packages sos
  if [ "$action" == "install" ]; then
    action=remove
  else action=install
  fi
  date
  sleep 5
  date
done

4. Monitor outputs of 2.

Actual results:
memory of qdrouterd grows over time
links list in qdstat -l output grows over time


Expected results:
Neither memory or number of qdrouter links grows over time


Additional info:

Comment 2 Pavel Moravec 2016-02-26 17:26:47 UTC
(just restarting goferd is _not_ a reproducer leading to the leak, btw)

Comment 3 Pavel Moravec 2016-02-29 10:43:08 UTC
Standalone reproducer outside Satellite6, using just a python script calling proton reactor API:

1) Have qpid broker and link routing to/from pulp.task queue there (create the queue in qpidd)

2) Have this script connecting to qdrouterd on port 5648 (reproducible independently on SSL used or not):

# cat bz1312419-qdrouterd-memLeak-links.py
#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException
from proton import Message

from proton.utils import BlockingConnection

import traceback
import random

conn = BlockingConnection("proton+amqp://0.0.0.0:5648", ssl_domain=None, heartbeat=2)

while True:
  sleep(random.uniform(0.3,3))
  try:
    snd = conn.create_sender("pulp.task", name=str(uuid4()))
    print "sleeping.."
    sleep(2.5)
    snd.send(Message(body="test", durable=True))
    snd.close()
  except ConnectionException:
    try:
      if conn:
        conn.close()
        conn = BlockingConnection("proton+amqp://0.0.0.0:5648", ssl_domain=None, heartbeat=2)
    except Exception, e:
      print e
      print(traceback.format_exc())
#

3) Call the script - calling it sequentially, you see the mem.leak (use qdstat -l and ps commands to verify). It seems the bug is not deterministic, link leak occurs sometimes only.

4) Call the script 10times in parallel and qdrouterd segfaults in very few minutes with backtrace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe99fc700 (LWP 18411)]
qd_link_pn (link=link@entry=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:768
768	    return link->pn_link;
(gdb) bt full
#0  qd_link_pn (link=link@entry=0x0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:768
No locals.
#1  0x00007ffff7bc5c28 in qd_router_flow (context=0x7fffc8046230, discard=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1261
        link = 0x0
        pn_link = <optimized out>
        delta = <optimized out>
        le = 0x7fffc8046230
#2  0x00007ffff7bc8f36 in invoke_deferred_calls (conn=conn@entry=0x7fffd8023860, discard=discard@entry=false)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:304
        calls = {head = 0x0, tail = <optimized out>, scratch = 0x7fffc8047770, size = <optimized out>}
        dc = 0x7fffc8047770
#3  0x00007ffff7bc9bed in process_connector (cxtr=0x7fffd8015c00, qd_server=0x8fcc10)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:397
        ctx = 0x7fffd8023860
        events = <optimized out>
        passes = <optimized out>
#4  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
        work_done = 1
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7fffd8015c00
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x8fcc10
#5  0x00007ffff773bdc5 in start_thread (arg=0x7fffe99fc700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7fffe99fc700
        now = <optimized out>



My understanding of the error:
- key factor: send a message to qdrouterd on a heartbeat-timeouted sender
- seems the router (sometimes?) re-enables(?) the link - here the link leak origins
- when the link is attempted to be used (by some real connection/link?), segfault can occur

Comment 4 Pavel Moravec 2016-03-09 08:29:15 UTC
Trying to reproduce this again (if backtrace can be a different one matching a customer _different_ backtrace), some new observations found:

1) segfault appears more often on the Satellite's qdrouterd (following the python script reproducer, outside Sat) than on Capsule's qdrouterd.

2) backtrace is always the same (so far)

Comment 5 Pavel Moravec 2016-03-09 08:30:49 UTC
(In reply to Pavel Moravec from comment #4)
> Trying to reproduce this again (if backtrace can be a different one matching
> a customer _different_ backtrace), some new observations found:
> 
> 1) segfault appears more often on the Satellite's qdrouterd (following the
> python script reproducer, outside Sat) than on Capsule's qdrouterd.

I mean: having link routing from Capsule's qdrouterd via Satellite's qdrouterd to qpidd on Satellite, and running the script against Capsule's qdrouterd, it is qdrouterd on _Satellite_ that segfaults more often. With same backtace.

Comment 6 mlinden 2016-04-01 15:05:29 UTC
There is another bug open as well around what happens if you uninstall and install a package over and over again, there is tasks failing too when you do this. Is this in anyway related? bz 1293911

Comment 7 Pavel Moravec 2016-04-02 10:10:04 UTC
(In reply to mlinden from comment #6)
> There is another bug open as well around what happens if you uninstall and
> install a package over and over again, there is tasks failing too when you
> do this. Is this in anyway related? bz 1293911

Sounds two separate issues:
- here (bz1312419) about a FD leak in qdrouterd
- there (bz1293911) about concurrency bug in Actions::Katello::System::GenerateApplicability task (that apparently cant be run in parallel)

Reproducer from bz1293911 does not trigger bz1312419 since it does not use qdrouterd / goferd.

Reproducer from this bz1312419 *can* trigger bz1293911 since it installs and uninstalls a package, that triggers GenerateApplicability as well (though with lower frequency since there is some overhead in katello agent / goferd that needs some time to really call yum command)

Comment 8 Pavel Moravec 2016-04-06 13:06:41 UTC
(In reply to Pavel Moravec from comment #3)

> My understanding of the error:
> - key factor: send a message to qdrouterd on a heartbeat-timeouted sender
> - seems the router (sometimes?) re-enables(?) the link - here the link leak
> origins
> - when the link is attempted to be used (by some real connection/link?),
> segfault can occur

Another backtrace of qdrouterd segfault that is _assumed_ to have same cause:

#0  pni_record_find (record=<optimized out>, record=<optimized out>, key=key@entry=0)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71
#1  pn_record_get (record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:120
#2  0x00007f4d633323c3 in pn_connection_get_context (conn=<optimized out>)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184
#3  0x00007f4d63575e25 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:780
#4  0x00007f4d63583445 in router_link_attach_handler (context=0x2382e40, link=<optimized out>)
    at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1612
#5  0x00007f4d63575285 in handle_link_open (container=<optimized out>, pn_link=0x7f4d4e790090)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:201
#6  process_handler (unused=<optimized out>, qd_conn=0x7f4d5000cbe0, container=0x22f8620)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:389
#7  handler (handler_context=0x22f8620, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f4d5000cbe0)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:496
#8  0x00007f4d6358742c in process_connector (cxtr=0x7f4d50010340, qd_server=0x21f9b80)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:394
#9  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:622

Here, the cause is link assignment failed (due to all links consumed/in use) and proton returned null instead of pointer to a link object. This sounds possible for the original backtrace as well.

Comment 16 Pavel Moravec 2016-05-04 06:45:11 UTC
FYI on 0.4-11 (I guess on any other as well), the leak is visible in coredump when printing router links:

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fe26acc985f in sys_cond_wait (cond=<optimized out>, held_mutex=0x1fb5e10)
    at /usr/src/debug/qpid-dispatch-0.4/src/posix/threading.c:107
#2  0x00007fe26acd2a79 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:508
#3  0x00007fe26acd3820 in qd_server_run (qd=0x1cf4030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:961
#4  0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffee370bf2d "/etc/qpid-dispatch/qdrouterd.conf", 
    python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2)
    at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135
#5  0x0000000000401950 in main (argc=3, argv=0x7ffee3709de8) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335
(gdb) frame 3
#3  0x00007fe26acd3820 in qd_server_run (qd=0x1cf4030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:961
961	    thread_run((void*) qd_server->threads[0]);
(gdb) p qd_server->qd->router->links
$1 = {head = 0x7fe24000c240, tail = 0x20cfa80, scratch = 0x0, size = 52}
(gdb) 

Here, router claims it has 52 links while it had approx. 5 only.

Comment 39 Mike McCune 2016-06-21 21:53:06 UTC
Created attachment 1170419 [details]
hotfix packages

Comment 44 Jitendra Yejare 2016-07-14 08:20:48 UTC
@Pavel,

I was attempting to verify this bug on Sat 6.2 GA Snap 20.1 and I observed that the links are not increasing but the memory leak is still there:

The observation within half hour of run:

1. Links:
After hour, the link is still only one which is expected as below:

-> # qdstat -b 0.0.0.0:5647 --ssl-certificate /etc/pki/katello/qpid_router_client.crt --ssl-key /etc/pki/katello/qpid_router_client.key --ssl-trustfile /etc/pki/katello/certs/katello-default-ca.crt -l
Router Links
  type          dir  rindex  class  address          phase  event-fifo  msg-fifo
  ================================================================================
  inter-router  in   0       -      -                -      0           -
  inter-router  out  1       local  qdhello                 0           0
  endpoint      out  -       -      -                -      0           0
  endpoint      in   -       -      -                -      0           -
  endpoint      in   -       -      -                -      0           -
  endpoint      out  -       local  temp.ZmfbPJOoog         0           0
# qdstat -l
LinkDetached: sender 0769f652-0d61-4ed4-b5a2-84416dacaeaf-$management to $management closed due to: Condition('amqp:not-found', 'Node not found: $management')
# qdstat -l
LinkDetached: sender 9ad682e1-bf40-4d58-a929-9f26390844eb-$management to $management closed due to: Condition('amqp:not-found', 'Node not found: $management')
# qdstat -l
LinkDetached: sender 0bd2a4f7-29b7-485c-83be-2b3ed37c00da-$management to $management closed due to: Condition('amqp:not-found', 'Node not found: $management')


2. The memory leak is still observed:

Every minute its occupying an average of 100kb memory as below:
# ps aux | grep qdrouterd | grep -v grep
qdroute+ 13199  0.1  0.1 440744 24108 ?        Ssl  Jul13   2:00 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
# ps aux | grep qdrouterd | grep -v grep
qdroute+ 13199  0.1  0.1 440744 24312 ?        Ssl  Jul13   2:00 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
# ps aux | grep qdrouterd | grep -v grep
qdroute+ 13199  0.1  0.1 440888 24496 ?        Ssl  Jul13   2:01 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
# ps aux | grep qdrouterd | grep -v grep
qdroute+ 13199  0.1  0.1 440888 24580 ?        Ssl  Jul13   2:01 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
# ps aux | grep qdrouterd | grep -v grep
qdroute+ 13199  0.1  0.1 440888 24764 ?        Ssl  Jul13   2:01 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf

This behavior is still doubtfull to make this bug as Verified. So can you please confirm whether to verify this bug or reassign to fix ?

Comment 45 Pavel Moravec 2016-07-15 07:18:47 UTC
I second that even the 0.4-12 package suffers by memory leak - see my previous updates. I think the reproducer hit:

- a link leak in qdrouterd that causes immediate memory leak and also segfault after a while - this seems to be fixed
- another memory leak (approx. 10kB per one package install) that is still unfixed

Technically, the BZ has not been completely fixed since the same reproducer is still valid for one of two negative consequences.

The still-present memory leak has simple reproducer based on a .py script using proton reactor library only - no Satellite bits. See https://bugzilla.redhat.com/show_bug.cgi?id=1312419#c32 or the script in some above update.

We can have one BZ for one separate memory leak, claiming this BZ fixes just the link leak one and new one the remaining leak. But then I would highly suggest have it fixed in 6.2.0 as well - because there is feedback from Andrew's customer the 0.4-12 version can consume all RAM of Satellite within a day. This does not sound - in customer eyes - we really fix the leak..

To sum it up, I recommend either:
- mark this as verified and open new blocker+ sat6.2.0+ BZ requesting to fix the remaining leak
- or return this as failedQE since a nontrivial leak is still present (and I warned about this in past already)

Comment 46 Jitendra Yejare 2016-07-15 09:56:20 UTC
@Pavel,


I would like to return this bug in Assigned (Failed QE) state due to memory leak issue is still there and its good to keep the whole story in one bug only instead of branching them.

So as per my comment 44 and comment 45, I am moving this bz to Assigned state.

Thanks,
Jitendra

Comment 48 Eric Helms 2016-07-15 11:57:48 UTC
I believe this failed because the tests were done on qpid-dispatch-0.4-11 because we had not actually pulled in 0.4-12 to the Satellite repositories yet. This was accidentally put to the ON_QA state due fields being set on the bug that told our tooling this was ready to go to ON_QA. I will be properly tagging and bringing this in and flipping this back to ON_QA when 0.4-12 is in the repositories.

Comment 49 Pavel Moravec 2016-07-15 12:01:18 UTC
(In reply to Eric Helms from comment #48)
> I believe this failed because the tests were done on qpid-dispatch-0.4-11
> because we had not actually pulled in 0.4-12 to the Satellite repositories
> yet. This was accidentally put to the ON_QA state due fields being set on
> the bug that told our tooling this was ready to go to ON_QA. I will be
> properly tagging and bringing this in and flipping this back to ON_QA when
> 0.4-12 is in the repositories.

OK, anyway per my private testing of 0.4-12, the 2nd mem.leak is still even there - see comment #30 and #32.

Comment 50 Mike McCune 2016-07-18 15:41:37 UTC
lets leave this bug as-is and continue to see if we can find a resolution to the memory leak (vs closing this one out). The end user just needs the leak resolved and will work towards finding a solution.

Comment 54 Jitendra Yejare 2016-07-21 13:17:56 UTC
Hi Eric,

The rate of increase is same as 0.4-13 and I dont find any different that 0.4-12.
The rate is same at an avg. of 100KB per minute mentioned in my last test comment 44.

If that rate is expected, then I am happy to close this bug.

Comment 56 Jitendra Yejare 2016-07-22 09:17:10 UTC
Verified !

@ Sat 6.2 Snap 21.0 GA

As per discussions and half hour test done on qpid 0.4-13, I can say that :
1. the number of links are not growing over time and is fixed. 
2. The memory leak is still observed but as compare to earlier versions / snaps it looks better and the minor memory leak is observed. The more details about memory leak on 0.4-13 is in comment 51 which i FailedQa earlier but as compare to earlier snap and the minor memory leak is acceptable.

Also for that minor leak a new bug has been created: https://bugzilla.redhat.com/show_bug.cgi?id=1358948 which will be fixed in later z-stream builds.

So as of now verifying this bug in the impression of minor leak is still observed but will be fixed in z- stream builds.

Comment 58 Bryan Kearney 2016-07-27 11:02:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1501

Comment 61 Barbora Vassova 2016-08-10 12:28:45 UTC
Created attachment 1189617 [details]
Valgrind output for qdrouterd that ran with reproducer from comment #1 (5 loops)

Comment 62 Barbora Vassova 2016-08-10 12:29:51 UTC
Created attachment 1189618 [details]
Valgrind output for qdrouterd that ran with reproducer from comment #1 (50 loops)


Note You need to log in before you can comment on or make changes to this bug.