Description of problem: As a side-effect of https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 , there was noticed a deadlocked qdrouterd on Capsule, not reacting to anything. Since *some* deadlock of Capsule's qdrouterd has been detected recently at a customer, it is expected the scenario from https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 reliably mimics a real situation. Version-Release number of selected component (if applicable): qpid-proton-c-0.9-16.el7.x86_64 qpid-dispatch-router-0.4-22.el7sat.x86_64 libqpid-dispatch-0.4-22.el7sat.x86_64 How reproducible: 100% within 30 minutes Steps to Reproduce: 1. Follow https://bugzilla.redhat.com/show_bug.cgi?id=1491160#c3 Actual results: qdrouterd on Capsule dont react to "kill" (until I specify "kill -9", of course), has many close waits, dont react to anything. Expected results: no deadlock. Additional info: gdb shows: (gdb) thread apply all bt full Thread 4 (Thread 0x7f29db69c1c0 (LWP 91774)): #0 0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> #2 <signal handler called> No locals. #3 0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #4 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> #5 0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677 work_done = <optimized out> timer = <optimized out> thread = <optimized out> work = <optimized out> cxtr = 0x7f29b5ebe490 conn = <optimized out> ctx = <optimized out> error = <optimized out> poll_result = <optimized out> qd_server = 0x226fbe0 #6 0x00007f29db27e9c0 in qd_server_run (qd=0x1ffc030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971 qd_server = 0x226fbe0 i = <optimized out> #7 0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffeb32d255d "/etc/qpid-dispatch/qdrouterd.conf", python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135 st = {st_dev = 64768, st_ino = 100760246, st_nlink = 3, st_mode = 16877, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 36, st_blksize = 4096, st_blocks = 0, st_atim = { tv_sec = 1493139759, tv_nsec = 0}, st_mtim = {tv_sec = 1505493173, tv_nsec = 576980480}, st_ctim = {tv_sec = 1505493173, tv_nsec = 576980480}, __unused = {0, 0, 0}} d = <optimized out> #8 0x0000000000401950 in main (argc=3, argv=0x7ffeb32d04e8) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335 config_path = 0x7ffeb32d255d "/etc/qpid-dispatch/qdrouterd.conf" python_pkgdir = 0x402401 "/usr/lib/qpid-dispatch/python" pidfile = 0x0 user = 0x0 daemon_mode = false long_options = {{name = 0x40245b "config", has_arg = 1, flag = 0x0, val = 99}, {name = 0x402462 "include", has_arg = 1, flag = 0x0, val = 73}, {name = 0x40246a "daemon", has_arg = 0, flag = 0x0, val = 100}, {name = 0x402471 "pidfile", has_arg = 1, flag = 0x0, val = 80}, {name = 0x402479 "user", has_arg = 1, flag = 0x0, val = 85}, { name = 0x40247e "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}} Thread 3 (Thread 0x7f29cdb08700 (LWP 91778)): #0 0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> ---Type <return> to continue, or q <return> to quit--- #2 <signal handler called> No locals. #3 0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #4 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> #5 0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677 work_done = <optimized out> timer = <optimized out> thread = <optimized out> work = <optimized out> cxtr = 0x7f29b5ebe620 conn = <optimized out> ctx = <optimized out> error = <optimized out> poll_result = <optimized out> qd_server = 0x226fbe0 #6 0x00007f29dadeee25 in start_thread (arg=0x7f29cdb08700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f29cdb08700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817521284864, -4328525230991775127, 0, 139817521285568, 139817521284864, 0, 4449106029563308649, 4449073203975907945}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #7 0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 2 (Thread 0x7f29ce309700 (LWP 91777)): #0 0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> #2 <signal handler called> No locals. #3 0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #4 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> #5 0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677 work_done = <optimized out> timer = <optimized out> thread = <optimized out> work = <optimized out> ---Type <return> to continue, or q <return> to quit--- cxtr = 0x7f29b5ebe300 conn = <optimized out> ctx = <optimized out> error = <optimized out> poll_result = <optimized out> qd_server = 0x226fbe0 #6 0x00007f29dadeee25 in start_thread (arg=0x7f29ce309700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f29ce309700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817529677568, -4328525230991775127, 0, 139817529678272, 139817529677568, 0, 4449100535763266153, 4449073203975907945}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #7 0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. Thread 1 (Thread 0x7f29ceb0a700 (LWP 91776)): #0 0x00007f29dadf56ad in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #1 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> #2 <signal handler called> No locals. #3 0x00007f29dadf56ab in write () at ../sysdeps/unix/syscall-template.S:81 No locals. #4 0x00007f29db273ff0 in qdpn_driver_wakeup (d=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:828 count = <optimized out> #5 0x00007f29db27da74 in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:677 work_done = <optimized out> timer = <optimized out> thread = <optimized out> work = <optimized out> cxtr = 0x7f29b22e4760 conn = <optimized out> ctx = <optimized out> error = <optimized out> poll_result = <optimized out> qd_server = 0x226fbe0 #6 0x00007f29dadeee25 in start_thread (arg=0x7f29ceb0a700) at pthread_create.c:308 __res = <optimized out> pd = 0x7f29ceb0a700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139817538070272, -4328525230991775127, 0, 139817538070976, 139817538070272, 0, 4449099435714767465, 4449073203975907945}, ---Type <return> to continue, or q <return> to quit--- mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #7 0x00007f29da34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 No locals. (gdb) See /root/core.91774 on dell-per430-14 there.
I believe this was fixed in DISPATCH-518. https://issues.apache.org/jira/browse/DISPATCH-518
pre-verified as fixed in a build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=600616 (qpid-dispatch-0.4-27.el7sat)
VERIFIED on satellite-6.2.14-1.0.el7sat.noarch I also tried the following: - create a docker image with rhel with installed katello-consumer-ca and installed katello-agent (yet not registered to satellite) - as a startup script, run subscription-manager registration and append a conditional loop that will start up the gofer daemon on some sort of trigger (i mounted an external dir and made a conditional to check for a presence of some file). - start up many containers (tried with 10,30,50). - after all the containers are up and their registration is finished (verify by listing the content hosts in satellite and that there are no more requests arriving to /rhsm endpoint), pull the trigger (in my case, create the file) to break the waiting loop, that would run goferd on all containers simultaneously. - observe the number pulp.agent* queues bumps by the number of the running containers in a moment - watch the logs for any errors - no erorrs detected
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0273