Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1492355 - sporadic deadlock of qdrouterd on bursts of goferd (dis)connection requests
Summary: sporadic deadlock of qdrouterd on bursts of goferd (dis)connection requests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.2.11
Hardware: x86_64
OS: Linux
high
high
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Roman Plevka
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-16 19:26 UTC by Pavel Moravec
Modified: 2022-07-09 09:22 UTC (History)
10 users (show)

Fixed In Version: qpid-dispatch-0.4-27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1530689 (view as bug list)
Environment:
Last Closed: 2018-02-05 13:55:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1491160 0 high CLOSED qdrouterd segfault when processing bursts of goferd requests 2024-10-01 16:05:47 UTC
Red Hat Product Errata RHSA-2018:0273 0 normal SHIPPED_LIVE Important: Red Hat Satellite 6 security, bug fix, and enhancement update 2018-02-08 00:35:29 UTC

Internal Links: 1491160

Description Pavel Moravec 2017-09-16 19:26:59 UTC
Description of problem:
As a side-effect of https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 , there was noticed a deadlocked qdrouterd on Capsule, not reacting to anything.

Since *some* deadlock of Capsule's qdrouterd has been detected recently at a customer, it is expected the scenario from https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 reliably mimics a real situation.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-16.el7.x86_64
qpid-dispatch-router-0.4-22.el7sat.x86_64
libqpid-dispatch-0.4-22.el7sat.x86_64


How reproducible:
100% within 30 minutes


Steps to Reproduce:
1. Follow https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 


Actual results:
qdrouterd on Capsule dont react to "kill" (until I specify "kill -9", of course), has many close waits, dont react to anything.


Expected results:
no deadlock.


Additional info:
gdb shows:

(gdb) thread apply all bt full

Thread 4 (Thread 0x7f29db69c1c0 (LWP 91774)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f29b5ebe490
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29db27e9c0 in qd_server_run (qd=0x1ffc030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971
        qd_server = 0x226fbe0
        i = <optimized out>
#7  0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffeb32d255d "/etc/qpid-dispatch/qdrouterd.conf", 
    python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135
        st = {st_dev = 64768, st_ino = 100760246, st_nlink = 3, st_mode = 16877, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 36, st_blksize = 4096, st_blocks = 0, st_atim = {
            tv_sec = 1493139759, tv_nsec = 0}, st_mtim = {tv_sec = 1505493173, tv_nsec = 576980480}, st_ctim = {tv_sec = 1505493173, tv_nsec = 576980480}, __unused = {0, 0, 0}}
        d = <optimized out>
#8  0x0000000000401950 in main (argc=3, argv=0x7ffeb32d04e8) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335
        config_path = 0x7ffeb32d255d "/etc/qpid-dispatch/qdrouterd.conf"
        python_pkgdir = 0x402401 "/usr/lib/qpid-dispatch/python"
        pidfile = 0x0
        user = 0x0
        daemon_mode = false
        long_options = {{name = 0x40245b "config", has_arg = 1, flag = 0x0, val = 99}, {name = 0x402462 "include", has_arg = 1, flag = 0x0, val = 73}, {name = 0x40246a "daemon", 
            has_arg = 0, flag = 0x0, val = 100}, {name = 0x402471 "pidfile", has_arg = 1, flag = 0x0, val = 80}, {name = 0x402479 "user", has_arg = 1, flag = 0x0, val = 85}, {
            name = 0x40247e "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}

Thread 3 (Thread 0x7f29cdb08700 (LWP 91778)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
---Type <return> to continue, or q <return> to quit---
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f29b5ebe620
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29dadeee25 in start_thread (arg=0x7f29cdb08700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f29cdb08700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817521284864, -4328525230991775127, 0, 139817521285568, 139817521284864, 0, 4449106029563308649, 4449073203975907945}, 
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#7  0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 2 (Thread 0x7f29ce309700 (LWP 91777)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
---Type <return> to continue, or q <return> to quit---
        cxtr = 0x7f29b5ebe300
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29dadeee25 in start_thread (arg=0x7f29ce309700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f29ce309700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817529677568, -4328525230991775127, 0, 139817529678272, 139817529677568, 0, 4449100535763266153, 4449073203975907945}, 
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#7  0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 1 (Thread 0x7f29ceb0a700 (LWP 91776)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f29b22e4760
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29dadeee25 in start_thread (arg=0x7f29ceb0a700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f29ceb0a700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817538070272, -4328525230991775127, 0, 139817538070976, 139817538070272, 0, 4449099435714767465, 4449073203975907945}, 
---Type <return> to continue, or q <return> to quit---
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#7  0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.
(gdb) 

See /root/core.91774 on dell-per430-14 there.

Comment 3 Ted Ross 2017-09-18 15:51:59 UTC
I believe this was fixed in DISPATCH-518.

https://issues.apache.org/jira/browse/DISPATCH-518

Comment 9 Pavel Moravec 2017-09-30 13:06:13 UTC
pre-verified as fixed in a build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=600616   (qpid-dispatch-0.4-27.el7sat)

Comment 15 Roman Plevka 2018-01-15 13:36:57 UTC
VERIFIED
on
satellite-6.2.14-1.0.el7sat.noarch

I also tried the following:

- create a docker image with rhel with installed katello-consumer-ca and installed katello-agent (yet not registered to satellite)
- as a startup script, run subscription-manager registration and append a conditional loop that will start up the gofer daemon on some sort of trigger (i mounted an external dir and made a conditional to check for a presence of some file).

- start up many containers (tried with 10,30,50).
- after all the containers are up and their registration is finished (verify by listing the content hosts in satellite and that there are no more requests arriving to /rhsm endpoint), pull the trigger (in my case, create the file) to break the waiting loop, that would run goferd on all containers simultaneously.
- observe the number pulp.agent* queues bumps by the number of the running containers in a moment
- watch the logs for any errors


- no erorrs detected

Comment 18 errata-xmlrpc 2018-02-05 13:55:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0273


Note You need to log in before you can comment on or make changes to this bug.