Bug 2109451
Summary: | virsh command will hang after the host run several auto test cases | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | yalzhang <yalzhang> |
Component: | sssd | Assignee: | Alexey Tikhonov <atikhono> |
Status: | CLOSED ERRATA | QA Contact: | sssd-qe |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 9.1 | CC: | atikhono, bugproxy, cagney, dzheng, fjin, fweimer, jen, jil, jsuchane, lcheng, lizhu, mprivozn, pbrezina, pkrempa, sgadekar, sipoyare, smitterl, virt-maint, weizhan, yoyang |
Target Milestone: | rc | Keywords: | TestBlocker, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | sync-to-jira | ||
Fixed In Version: | sssd-2.7.3-2.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-11-15 11:17:44 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2066305, 2091421, 2111582 |
Description
yalzhang@redhat.com
2022-07-21 09:37:26 UTC
Sorry, the same attachment attached 2 times... There are two threads in the stack trace which appear to be stuck in a similar place: Thread 8 (Thread 0x7fb73effd640 (LWP 169334) "prio-rpc-virtqe"): #0 futex_wait (private=0, expected=2, futex_word=0x7fb74c05f088 <gr_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146 #1 __GI___lll_lock_wait (futex=futex@entry=0x7fb74c05f088 <gr_mc_ctx+8>, private=0) at lowlevellock.c:50 #2 0x00007fb74f8091b2 in lll_mutex_lock_optimized (mutex=0x7fb74c05f088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:49 #3 ___pthread_mutex_lock (mutex=mutex@entry=0x7fb74c05f088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:89 #4 0x00007fb74c0566da in sss_mt_lock (ctx=0x7fb74c05f080 <gr_mc_ctx>) at src/sss_client/nss_mc_common.c:61 #5 sss_nss_mc_init_ctx (ctx=0x7fb74c05f080 <gr_mc_ctx>, name=0x7fb74c05c042 "group") at src/sss_client/nss_mc_common.c:155 #6 sss_nss_mc_get_ctx (name=name@entry=0x7fb74c05c042 "group", ctx=ctx@entry=0x7fb74c05f080 <gr_mc_ctx>) at src/sss_client/nss_mc_common.c:226 #7 0x00007fb74c05738e in sss_nss_mc_getgrgid (gid=gid@entry=0, result=result@entry=0x7fb73effc8e0, buffer=buffer@entry=0x7fb728004360 "", buflen=buflen@entry=1024) at src/sss_client/nss_mc_group.c:194 #8 0x00007fb74c059bc0 in _nss_sss_getgrgid_r (gid=0, result=0x7fb73effc8e0, buffer=0x7fb728004360 "", buflen=1024, errnop=0x7fb73effd5b8) at src/sss_client/nss_group.c:534 #9 0x00007fb74f87be47 in __getgrgid_r (gid=gid@entry=0, resbuf=resbuf@entry=0x7fb73effc8e0, buffer=0x7fb728004360 "", buflen=1024, result=result@entry=0x7fb73effc8d0) at ../nss/getXXbyYY_r.c:274 #10 0x00007fb74fe6eab9 in virGetGroupEnt (gid=0) at ../src/util/virutil.c:637 #11 virGetGroupName (gid=0) at ../src/util/virutil.c:687 #12 0x00007fb74fe27365 in virIdentityGetSystem () at ../src/util/viridentity.c:337 #13 0x0000558baea8a1be in remoteClientFreePrivateCallbacks (priv=0x558bb0aee890) at ../src/remote/remote_daemon_dispatch.c:1712 #14 remoteClientCloseFunc (client=<optimized out>) at ../src/remote/remote_daemon_dispatch.c:1781 #15 0x00007fb74ff26b21 in virNetServerClientCloseLocked (client=0x558bb0ad1560) at ../src/rpc/virnetserverclient.c:982 #16 0x00007fb74ff26c3d in virNetServerClientClose (client=0x558bb0ad1560) at ../src/rpc/virnetserverclient.c:1020 #17 0x00007fb74ff2b3ea in virNetServerHandleJob (jobOpaque=0x558bb0aeece0, opaque=0x558bb0ac4880) at ../src/rpc/virnetserver.c:166 #18 0x00007fb74fe66433 in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:164 #19 0x00007fb74fe659e9 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256 #20 0x00007fb74f805d92 in start_thread (arg=<optimized out>) at pthread_create.c:443 #21 0x00007fb74f7a59e0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 1 (Thread 0x7fb74e341ac0 (LWP 169327) "virtqemud"): #0 futex_wait (private=0, expected=2, futex_word=0x7fb74c05f108 <pw_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146 #1 __GI___lll_lock_wait (futex=futex@entry=0x7fb74c05f108 <pw_mc_ctx+8>, private=0) at lowlevellock.c:50 #2 0x00007fb74f8091b2 in lll_mutex_lock_optimized (mutex=0x7fb74c05f108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:49 #3 ___pthread_mutex_lock (mutex=mutex@entry=0x7fb74c05f108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:89 #4 0x00007fb74c0566da in sss_mt_lock (ctx=0x7fb74c05f100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:61 #5 sss_nss_mc_init_ctx (ctx=0x7fb74c05f100 <pw_mc_ctx>, name=0x7fb74c05c03b "passwd") at src/sss_client/nss_mc_common.c:155 #6 sss_nss_mc_get_ctx (name=name@entry=0x7fb74c05c03b "passwd", ctx=ctx@entry=0x7fb74c05f100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:226 #7 0x00007fb74c0571ee in sss_nss_mc_getpwuid (uid=uid@entry=0, result=result@entry=0x7ffe32f87c70, buffer=buffer@entry=0x558bb0b00200 "", buflen=buflen@entry=1024) at src/sss_client/nss_mc_passwd.c:187 #8 0x00007fb74c0590ae in _nss_sss_getpwuid_r (uid=0, result=0x7ffe32f87c70, buffer=0x558bb0b00200 "", buflen=1024, errnop=0x7fb74e341a38) at src/sss_client/nss_passwd.c:259 #9 0x00007fb74f87db0f in __getpwuid_r (uid=uid@entry=0, resbuf=resbuf@entry=0x7ffe32f87c70, buffer=<optimized out>, buflen=1024, result=result@entry=0x7ffe32f87c60) at ../nss/getXXbyYY_r.c:274 #10 0x00007fb74fe6e531 in virGetUserEnt (uid=0, name=name@entry=0x7ffe32f87cf0, group=group@entry=0x0, dir=dir@entry=0x0, shell=shell@entry=0x0, quiet=quiet@entry=false) at ../src/util/virutil.c:569 #11 0x00007fb74fe6e81a in virGetUserName (uid=<optimized out>) at ../src/util/virutil.c:681 #12 0x00007fb74fe27323 in virIdentityGetSystem () at ../src/util/viridentity.c:330 #13 0x0000558baea8a1be in remoteClientFreePrivateCallbacks (priv=0x558bb0af85d0) at ../src/remote/remote_daemon_dispatch.c:1712 #14 remoteClientCloseFunc (client=<optimized out>) at ../src/remote/remote_daemon_dispatch.c:1781 #15 0x00007fb74ff26b21 in virNetServerClientCloseLocked (client=0x558bb0af0b30) at ../src/rpc/virnetserverclient.c:982 #16 0x00007fb74ff2ad99 in virNetServerProcessClients (srv=0x558bb0ac4880) at ../src/rpc/virnetserver.c:950 #17 0x00007fb74ff2aebd in daemonServerProcessClients (payload=<optimized out>, key=<optimized out>, opaque=<optimized out>) at ../src/rpc/virnetdaemon.c:699 #18 0x00007fb74fe1eebe in virHashForEach (table=<optimized out>, iter=0x7fb74ff2aeb0 <daemonServerProcessClients>, opaque=0x0) at ../src/util/virhash.c:367 #19 0x00007fb74ff28d9b in virNetDaemonRun (dmn=0x558bb0ac3820) at ../src/rpc/virnetdaemon.c:808 #20 0x0000558baea57603 in main (argc=<optimized out>, argv=<optimized out>) at ../src/remote/remote_daemon.c:1211 One is stuck in getpwuid_r and the second one getgrgid_r. Both of those functions seem to end up in the NSS module of sssd. I'm not sure though how to diagnose such a thing Seems virtqemud wait somewhere: # gdb -p `pidof virtqemud` ...... [New LWP 338557] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Missing separate debuginfos, use: dnf debuginfo-install libnghttp2-1.43.0-5.el9.x86_64 libselinux-3.4-3.el9.x86_64 p11-kit-0.24.1-2.el9.x86_64 yajl-2.1.0-20.el9.x86_64 --Type <RET> for more, q to quit, c to continue without paging-- futex_wait (private=0, expected=2, futex_word=0x7fbcfc033088 <gr_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146 146 int err = lll_futex_timed_wait (futex_word, expected, NULL, private); (gdb) Yeah, I tend to agree that the problem is not actually in libvirt but sssd, because the deadlock does not occur between fork() + exec(); it occurs in the main thread (basically). I think this should be switch over to sssd for further investigation. virsh hangs for me too. Been using CTC2 recommended compose id RHEL-9.1.0-20220718.0 sssd version sssd-2.7.3-1.el9.s390x I'm not familiar with gstack, can't produce a similarly detailed trace. virsh returns with: error: Unable to register disconnect callback error: no valid connection error: internal error: connection closed due to keepalive timeout I'm setting priority and urgency critical to raise visibility, I hope that's okay but our CTC2 sign-offs look blocked on this. Please, shout if you disagree. (In reply to smitterl from comment #12) > virsh hangs for me too. Been using > > CTC2 recommended compose id RHEL-9.1.0-20220718.0 > > sssd version > sssd-2.7.3-1.el9.s390x > > I'm not familiar with gstack, can't produce a similarly detailed trace. > virsh returns with: > > error: Unable to register disconnect callback > error: no valid connection > error: internal error: connection closed due to keepalive timeout > > > I'm setting priority and urgency critical to raise visibility, I hope that's > okay but our CTC2 sign-offs look blocked on this. Please, shout if you > disagree. Forgot to mention that for me this seems to reproduce always (100%) I run our acceptance_general test job with 7.9 guest on s390x LPAR with indicated compose id. Other jobs running on beaker's virtual KVM hosts seem to be running fine; not sure if a specific test case might cause this behavior. I'll see if this reproduces with the same job (acceptance_general_guest7, e.g. [1]) a) with a later compose id and the same job ID b) with the same compose on beaker virtual KVM host (s390x-kvm-XXX) (In reply to smitterl from comment #13) > (In reply to smitterl from comment #12) > > virsh hangs for me too. Been using > > > > CTC2 recommended compose id RHEL-9.1.0-20220718.0 > > > > sssd version > > sssd-2.7.3-1.el9.s390x > > > > I'm not familiar with gstack, can't produce a similarly detailed trace. Confirmed offline with Peter that my trace is similar to the highlights in comment#5 Hi, does 'virsh' fork under the hood? I.e. can this be similar to https://bugzilla.redhat.com/show_bug.cgi?id=2069097#c16 ? (In reply to Peter Krempa from comment #5) > There are two threads in the stack trace which appear to be stuck in a > similar place: ... > I'm not sure though how to diagnose such a thing Would it be possible to collect coredump of stuck process (and inspect content of `gr_mc_ctx->mutex`, in particular `__data.__owner` to see who holds the mutex)? (In reply to Alexey Tikhonov from comment #16) > Hi, > > does 'virsh' fork under the hood? > I.e. can this be similar to > https://bugzilla.redhat.com/show_bug.cgi?id=2069097#c16 ? The process in question is 'virtqemud' and that indeed does fork heavily. The code path in question is the main process though (noted by having many threads, because once we fork we either exec() or do something trivial and exit), but never spawn more threads, so it should not have a problem with that particular usage at least according to the description in the comment you've linked. > (In reply to Peter Krempa from comment #5) > > There are two threads in the stack trace which appear to be stuck in a > > similar place: > ... > > I'm not sure though how to diagnose such a thing > > Would it be possible to collect coredump of stuck process (and inspect > content of `gr_mc_ctx->mutex`, in particular `__data.__owner` to see who > holds the mutex)? Our QE seems to be able to reproduce it reliably, so I'm fairly sure they can get you a coredump or get a debugger and analyze the mutex. Yalan, can you please either get a coredump or attach debuger to a stuck virtqemud, switch to the thread and get the required values? Thanks. (In reply to Peter Krempa from comment #18) > > > > Would it be possible to collect coredump of stuck process (and inspect > > content of `gr_mc_ctx->mutex`, in particular `__data.__owner` to see who > > holds the mutex)? > > Our QE seems to be able to reproduce it reliably, so I'm fairly sure they > can get you a coredump or get a debugger and analyze the mutex. > > Yalan, can you please either get a coredump or attach debuger to a stuck > virtqemud, switch to the thread and get the required values? Thanks. In the latter case please also provide output of `info threads`. As the coredump file is too large to be attached to bugzilla, I have uploaded it to google drive. Please download it here: https://drive.google.com/file/d/1HYbQR09rv0COCkA8rMTMci1C_yZD17cW/view?usp=sharing Thank you! Hi Florian, Could you please help (again) to debug issue with mutexes: [Switching to thread 1 (Thread 0x7f2e1d7ebac0 (LWP 621826))] (gdb) bt #0 futex_wait (private=0, expected=2, futex_word=0x7f2e1c315088 <gr_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146 #1 __GI___lll_lock_wait (futex=futex@entry=0x7f2e1c315088 <gr_mc_ctx+8>, private=0) at lowlevellock.c:50 #2 0x00007f2e1ecb31b2 in lll_mutex_lock_optimized (mutex=0x7f2e1c315088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:49 #3 ___pthread_mutex_lock (mutex=mutex@entry=0x7f2e1c315088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:89 #4 0x00007f2e1c30c6da in sss_mt_lock (ctx=0x7f2e1c315080 <gr_mc_ctx>) at src/sss_client/nss_mc_common.c:61 #5 sss_nss_mc_init_ctx (ctx=0x7f2e1c315080 <gr_mc_ctx>, name=0x7f2e1c312042 "group") at src/sss_client/nss_mc_common.c:155 (gdb) frame 2 #2 0x00007f2e1ecb31b2 in lll_mutex_lock_optimized (mutex=0x7f2e1c315088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:49 49 lll_lock (mutex->__data.__lock, private); (gdb) p *mutex $1 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 4294967295, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\377\377\377\377", '\000' <repeats 23 times>, __align = 0} (gdb) frame 1 #1 __GI___lll_lock_wait (futex=futex@entry=0x7f2e1c315088 <gr_mc_ctx+8>, private=0) at lowlevellock.c:50 50 futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2. */ -- why is this locked? "__lock = 0" && "__owner = 0" -- mutex isn't locked? what does "__nusers = 4294967295" mean? Florian is tied up with another project but I can help. However it looks like you've already figured out the problem; let me know if you need any further support. > what does "__nusers = 4294967295" mean? __nusers is a 32-bit unsigned set to 0-1 (i.e., -1); which looking at the mutex code would suggest that the mutex code decremented something that should have been >0 More interesting is the sequence I lifted from Bug 2111582: #5 sss_nss_mc_init_ctx (ctx=0x7f229c35c100 <pw_mc_ctx>, name=0x7f229c35903b "passwd") at src/sss_client/nss_mc_common.c:155 #6 sss_nss_mc_get_ctx (name=name@entry=0x7f229c35903b "passwd", ctx=ctx@entry=0x7f229c35c100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:226 #7 0x00007f229c353aa7 in sss_nss_mc_getpwnam (name=name@entry=0x7f22241191b0 "1000", name_len=name_len@entry=4, result=result@entry=0x7f2239ffa400, buffer=buffer@entry=0x7f22240b5b00 "nginx", buflen=buflen@entry=1024) at src/sss_client/nss_mc_passwd.c:110 And this value of ctx: $1 = {initialized = UNINITIALIZED, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 4294967295, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\377\377\377\377", '\000' <repeats 23 times>, __align = 0}, fd = -1, seed = 0, mmap_base = 0x0, mmap_size = 0, data_table = 0x0, dt_size = 0, hash_table = 0x0, ht_size = 0, active_threads = 1} If we assume .initialized==INITIALIZED, we find this code on the error path: if (ctx->initialized == INITIALIZED) { ctx->initialized = RECYCLED; } if (ctx->initialized == RECYCLED && ctx->active_threads == 0) { /* just one thread should call munmap */ sss_mt_lock(ctx); if (ctx->initialized == RECYCLED) { sss_nss_mc_destroy_ctx(ctx); } sss_mt_unlock(ctx); } A look at sss_nss_mc_destroy_ctx() shows it zeroing out the locked mutex; not good for the unlock() call that but does provide an explination for why __nusers==-1. Upstream PR: https://github.com/SSSD/sssd/pull/6289 (In reply to cagney from comment #27) > > A look at sss_nss_mc_destroy_ctx() shows it zeroing out the locked mutex; Yeap, already figured this out, thanks. Pushed PR: https://github.com/SSSD/sssd/pull/6289 * `master` * 0f3a761ed9d654a61f8caed8eae3863c518b9911 - CLIENT:MC: store context mutex outside of context as it should survive context destruction / re-initialization * `sssd-2-7` * 03142f8de42faf4f75465d24d3be9a49c2dd86f7 - CLIENT:MC: store context mutex outside of context as it should survive context destruction / re-initialization Run the auto cases "domiftune" (53 cases) for 2 times, can not reproduce the issue, the bug is fixed. *** Bug 2120965 has been marked as a duplicate of this bug. *** ------- Comment From tstaudt.com 2022-08-25 08:46 EDT------- (In reply to comment #14) > Also, can you please tell us what version of libvirt do you have? libvirt-8.5.0-1.el9.s390x We currently can't update due to another bug we're checking, but will eventually try the new version if required. ------- Comment From tstaudt.com 2022-09-13 08:28 EDT------- Hello Red Hat, it seems that the duplicated bug Red Hat Bug 2120965 - RHEL9.1 Nightly Build - virsh becomes unusable after running virsh commands in parallel is not fixed with this update. How do you suggest we move on? I fear re-opening will mess up the mirroring. Thanks. See the fedora bugs: Bug 2075736 - has workaround by not shutting down daemons Bug 2111582 - fixed by sssd it seems Bug 2123828 - workaround is to restart virtqemud before a run (In reply to IBM Bug Proxy from comment #40) > ------- Comment From tstaudt.com 2022-09-13 08:28 EDT------- > Hello Red Hat, > > it seems that the duplicated bug > Red Hat Bug 2120965 - RHEL9.1 Nightly Build - virsh becomes unusable after > running virsh commands in parallel > is not fixed with this update. > > How do you suggest we move on? I fear re-opening will mess up the mirroring. I think you can re-open bz 2120965, there is no mirroring. ------- Comment From tstaudt.com 2022-09-14 03:42 EDT------- I have reopened Red Hat bug 2120965, and the correct mirror link between that and the IBM bug will be restored later today. Thanks. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sssd bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8325 |