Bug 1492355 - sporadic deadlock of qdrouterd on bursts of goferd (dis)connection requests
Summary: sporadic deadlock of qdrouterd on bursts of goferd (dis)connection requests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.2.11
Hardware: x86_64
OS: Linux
high
high
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Roman Plevka
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-16 19:26 UTC by Pavel Moravec
Modified: 2022-07-09 09:22 UTC (History)
10 users (show)

Fixed In Version: qpid-dispatch-0.4-27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1530689 (view as bug list)
Environment:
Last Closed: 2018-02-05 13:55:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1491160 0 high CLOSED qdrouterd segfault when processing bursts of goferd requests 2022-07-09 09:22:04 UTC
Red Hat Product Errata RHSA-2018:0273 0 normal SHIPPED_LIVE Important: Red Hat Satellite 6 security, bug fix, and enhancement update 2018-02-08 00:35:29 UTC

Internal Links: 1491160

Description Pavel Moravec 2017-09-16 19:26:59 UTC
Description of problem:
As a side-effect of https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 , there was noticed a deadlocked qdrouterd on Capsule, not reacting to anything.

Since *some* deadlock of Capsule's qdrouterd has been detected recently at a customer, it is expected the scenario from https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 reliably mimics a real situation.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-16.el7.x86_64
qpid-dispatch-router-0.4-22.el7sat.x86_64
libqpid-dispatch-0.4-22.el7sat.x86_64


How reproducible:
100% within 30 minutes


Steps to Reproduce:
1. Follow https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 


Actual results:
qdrouterd on Capsule dont react to "kill" (until I specify "kill -9", of course), has many close waits, dont react to anything.


Expected results:
no deadlock.


Additional info:
gdb shows:

(gdb) thread apply all bt full

Thread 4 (Thread 0x7f29db69c1c0 (LWP 91774)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f29b5ebe490
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29db27e9c0 in qd_server_run (qd=0x1ffc030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971
        qd_server = 0x226fbe0
        i = <optimized out>
#7  0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffeb32d255d "/etc/qpid-dispatch/qdrouterd.conf", 
    python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135
        st = {st_dev = 64768, st_ino = 100760246, st_nlink = 3, st_mode = 16877, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 36, st_blksize = 4096, st_blocks = 0, st_atim = {
            tv_sec = 1493139759, tv_nsec = 0}, st_mtim = {tv_sec = 1505493173, tv_nsec = 576980480}, st_ctim = {tv_sec = 1505493173, tv_nsec = 576980480}, __unused = {0, 0, 0}}
        d = <optimized out>
#8  0x0000000000401950 in main (argc=3, argv=0x7ffeb32d04e8) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335
        config_path = 0x7ffeb32d255d "/etc/qpid-dispatch/qdrouterd.conf"
        python_pkgdir = 0x402401 "/usr/lib/qpid-dispatch/python"
        pidfile = 0x0
        user = 0x0
        daemon_mode = false
        long_options = {{name = 0x40245b "config", has_arg = 1, flag = 0x0, val = 99}, {name = 0x402462 "include", has_arg = 1, flag = 0x0, val = 73}, {name = 0x40246a "daemon", 
            has_arg = 0, flag = 0x0, val = 100}, {name = 0x402471 "pidfile", has_arg = 1, flag = 0x0, val = 80}, {name = 0x402479 "user", has_arg = 1, flag = 0x0, val = 85}, {
            name = 0x40247e "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}

Thread 3 (Thread 0x7f29cdb08700 (LWP 91778)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
---Type <return> to continue, or q <return> to quit---
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f29b5ebe620
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29dadeee25 in start_thread (arg=0x7f29cdb08700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f29cdb08700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817521284864, -4328525230991775127, 0, 139817521285568, 139817521284864, 0, 4449106029563308649, 4449073203975907945}, 
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#7  0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 2 (Thread 0x7f29ce309700 (LWP 91777)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
---Type <return> to continue, or q <return> to quit---
        cxtr = 0x7f29b5ebe300
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29dadeee25 in start_thread (arg=0x7f29ce309700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f29ce309700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817529677568, -4328525230991775127, 0, 139817529678272, 139817529677568, 0, 4449100535763266153, 4449073203975907945}, 
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#7  0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

Thread 1 (Thread 0x7f29ceb0a700 (LWP 91776)):
#0  0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#2  <signal handler called>
No locals.
#3  0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81
No locals.
#4  0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828
        count = <optimized out>
#5  0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677
        work_done = <optimized out>
        timer = <optimized out>
        thread = <optimized out>
        work = <optimized out>
        cxtr = 0x7f29b22e4760
        conn = <optimized out>
        ctx = <optimized out>
        error = <optimized out>
        poll_result = <optimized out>
        qd_server = 0x226fbe0
#6  0x00007f29dadeee25 in start_thread (arg=0x7f29ceb0a700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x7f29ceb0a700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817538070272, -4328525230991775127, 0, 139817538070976, 139817538070272, 0, 4449099435714767465, 4449073203975907945}, 
---Type <return> to continue, or q <return> to quit---
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#7  0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.
(gdb) 

See /root/core.91774 on dell-per430-14 there.

Comment 3 Ted Ross 2017-09-18 15:51:59 UTC
I believe this was fixed in DISPATCH-518.

https://issues.apache.org/jira/browse/DISPATCH-518

Comment 9 Pavel Moravec 2017-09-30 13:06:13 UTC
pre-verified as fixed in a build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=600616   (qpid-dispatch-0.4-27.el7sat)

Comment 15 Roman Plevka 2018-01-15 13:36:57 UTC
VERIFIED
on
satellite-6.2.14-1.0.el7sat.noarch

I also tried the following:

- create a docker image with rhel with installed katello-consumer-ca and installed katello-agent (yet not registered to satellite)
- as a startup script, run subscription-manager registration and append a conditional loop that will start up the gofer daemon on some sort of trigger (i mounted an external dir and made a conditional to check for a presence of some file).

- start up many containers (tried with 10,30,50).
- after all the containers are up and their registration is finished (verify by listing the content hosts in satellite and that there are no more requests arriving to /rhsm endpoint), pull the trigger (in my case, create the file) to break the waiting loop, that would run goferd on all containers simultaneously.
- observe the number pulp.agent* queues bumps by the number of the running containers in a moment
- watch the logs for any errors


- no erorrs detected

Comment 18 errata-xmlrpc 2018-02-05 13:55:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0273


Note You need to log in before you can comment on or make changes to this bug.