2109451 – virsh command will hang after the host run several auto test cases

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2109451 - virsh command will hang after the host run several auto test cases

Summary: virsh command will hang after the host run several auto test cases

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	9.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Alexey Tikhonov
QA Contact:	sssd-qe
Docs Contact:
URL:
Whiteboard:	sync-to-jira
Depends On:
Blocks:	2066305 2091421 2111582
TreeView+	depends on / blocked

Reported:	2022-07-21 09:37 UTC by yalzhang@redhat.com
Modified:	2022-11-15 13:08 UTC (History)
CC List:	20 users (show)
Fixed In Version:	sssd-2.7.3-2.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-15 11:17:44 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-128597	None	None	None	2022-07-21 09:50:27 UTC
Red Hat Issue Tracker	SSSD-4909	None	None	None	2022-07-30 10:18:45 UTC
Red Hat Product Errata	RHBA-2022:8325	None	None	None	2022-11-15 11:18:00 UTC

Internal Links: 2120965

Description yalzhang@redhat.com 2022-07-21 09:37:26 UTC

Description of problem:
virsh command will hang after the host run several auto cases

Version-Release number of selected component (if applicable):
libvirt-8.5.0-2.el9.x86_64
qemu-kvm-7.0.0-9.el9.x86_64
kernel-5.14.0-130.el9.x86_64

How reproducible:
don't know, but very high

Steps to Reproduce:
1. Many systems running auto tests encountered the problem that the test timeout and virsh command hang forever. # virsh list 
^C
# virsh -c network:///system 
Welcome to virsh, the virtualization interactive terminal.
Type:  'help' for help with commands
       'quit' to quit

virsh # 

# systemctl status virtqemud
● virtqemud.service - Virtualization qemu daemon
     Loaded: loaded (/usr/lib/systemd/system/virtqemud.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-07-20 22:08:24 EDT; 7h ago
TriggeredBy: ● virtqemud-admin.socket
             ● virtqemud-ro.socket
             ● virtqemud.socket
       Docs: man:virtqemud(8)
             https://libvirt.org
   Main PID: 169327 (virtqemud)
      Tasks: 18 (limit: 32768)
     Memory: 12.8M
        CPU: 156ms
     CGroup: /system.slice/virtqemud.service
             └─169327 /usr/sbin/virtqemud --timeout 120

The qemu's stack:Thread 1 (Thread 0x7fb74e341ac0 (LWP 169327) "virtqemud"):
#0  futex_wait (private=0, expected=2, futex_word=0x7fb74c05f108 <pw_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7fb74c05f108 <pw_mc_ctx+8>, private=0) at lowlevellock.c:50
#2  0x00007fb74f8091b2 in lll_mutex_lock_optimized (mutex=0x7fb74c05f108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:49
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x7fb74c05f108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:89
#4  0x00007fb74c0566da in sss_mt_lock (ctx=0x7fb74c05f100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:61
#5  sss_nss_mc_init_ctx (ctx=0x7fb74c05f100 <pw_mc_ctx>, name=0x7fb74c05c03b "passwd") at src/sss_client/nss_mc_common.c:155
#6  sss_nss_mc_get_ctx (name=name@entry=0x7fb74c05c03b "passwd", ctx=ctx@entry=0x7fb74c05f100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:226
#7  0x00007fb74c0571ee in sss_nss_mc_getpwuid (uid=uid@entry=0, result=result@entry=0x7ffe32f87c70, buffer=buffer@entry=0x558bb0b00200 "", buflen=buflen@entry=1024) at src/sss_client/nss_mc_passwd.c:187
#8  0x00007fb74c0590ae in _nss_sss_getpwuid_r (uid=0, result=0x7ffe32f87c70, buffer=0x558bb0b00200 "", buflen=1024, errnop=0x7fb74e341a38) at src/sss_client/nss_passwd.c:259
#9  0x00007fb74f87db0f in __getpwuid_r (uid=uid@entry=0, resbuf=resbuf@entry=0x7ffe32f87c70, buffer=<optimized out>, buflen=1024, result=result@entry=0x7ffe32f87c60) at ../nss/getXXbyYY_r.c:274
#10 0x00007fb74fe6e531 in virGetUserEnt (uid=0, name=name@entry=0x7ffe32f87cf0, group=group@entry=0x0, dir=dir@entry=0x0, shell=shell@entry=0x0, quiet=quiet@entry=false) at ../src/util/virutil.c:569
#11 0x00007fb74fe6e81a in virGetUserName (uid=<optimized out>) at ../src/util/virutil.c:681
#12 0x00007fb74fe27323 in virIdentityGetSystem () at ../src/util/viridentity.c:330
#13 0x0000558baea8a1be in remoteClientFreePrivateCallbacks (priv=0x558bb0af85d0) at ../src/remote/remote_daemon_dispatch.c:1712
#14 remoteClientCloseFunc (client=<optimized out>) at ../src/remote/remote_daemon_dispatch.c:1781
#15 0x00007fb74ff26b21 in virNetServerClientCloseLocked (client=0x558bb0af0b30) at ../src/rpc/virnetserverclient.c:982
#16 0x00007fb74ff2ad99 in virNetServerProcessClients (srv=0x558bb0ac4880) at ../src/rpc/virnetserver.c:950
#17 0x00007fb74ff2aebd in daemonServerProcessClients (payload=<optimized out>, key=<optimized out>, opaque=<optimized out>) at ../src/rpc/virnetdaemon.c:699
#18 0x00007fb74fe1eebe in virHashForEach (table=<optimized out>, iter=0x7fb74ff2aeb0 <daemonServerProcessClients>, opaque=0x0) at ../src/util/virhash.c:367
#19 0x00007fb74ff28d9b in virNetDaemonRun (dmn=0x558bb0ac3820) at ../src/rpc/virnetdaemon.c:808
#20 0x0000558baea57603 in main (argc=<optimized out>, argv=<optimized out>) at ../src/remote/remote_daemon.c:1211

Actual results:
virsh command hang after several test cases

Expected results:
virsh should not hang

Additional info:

Comment 4 yalzhang@redhat.com 2022-07-21 10:04:52 UTC

Sorry, the same attachment attached 2 times...

Comment 5 Peter Krempa 2022-07-21 10:06:48 UTC

There are two threads in the stack trace which appear to be stuck in a similar place:

Thread 8 (Thread 0x7fb73effd640 (LWP 169334) "prio-rpc-virtqe"):
#0  futex_wait (private=0, expected=2, futex_word=0x7fb74c05f088 <gr_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7fb74c05f088 <gr_mc_ctx+8>, private=0) at lowlevellock.c:50
#2  0x00007fb74f8091b2 in lll_mutex_lock_optimized (mutex=0x7fb74c05f088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:49
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x7fb74c05f088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:89
#4  0x00007fb74c0566da in sss_mt_lock (ctx=0x7fb74c05f080 <gr_mc_ctx>) at src/sss_client/nss_mc_common.c:61
#5  sss_nss_mc_init_ctx (ctx=0x7fb74c05f080 <gr_mc_ctx>, name=0x7fb74c05c042 "group") at src/sss_client/nss_mc_common.c:155
#6  sss_nss_mc_get_ctx (name=name@entry=0x7fb74c05c042 "group", ctx=ctx@entry=0x7fb74c05f080 <gr_mc_ctx>) at src/sss_client/nss_mc_common.c:226
#7  0x00007fb74c05738e in sss_nss_mc_getgrgid (gid=gid@entry=0, result=result@entry=0x7fb73effc8e0, buffer=buffer@entry=0x7fb728004360 "", buflen=buflen@entry=1024) at src/sss_client/nss_mc_group.c:194
#8  0x00007fb74c059bc0 in _nss_sss_getgrgid_r (gid=0, result=0x7fb73effc8e0, buffer=0x7fb728004360 "", buflen=1024, errnop=0x7fb73effd5b8) at src/sss_client/nss_group.c:534
#9  0x00007fb74f87be47 in __getgrgid_r (gid=gid@entry=0, resbuf=resbuf@entry=0x7fb73effc8e0, buffer=0x7fb728004360 "", buflen=1024, result=result@entry=0x7fb73effc8d0) at ../nss/getXXbyYY_r.c:274
#10 0x00007fb74fe6eab9 in virGetGroupEnt (gid=0) at ../src/util/virutil.c:637
#11 virGetGroupName (gid=0) at ../src/util/virutil.c:687
#12 0x00007fb74fe27365 in virIdentityGetSystem () at ../src/util/viridentity.c:337
#13 0x0000558baea8a1be in remoteClientFreePrivateCallbacks (priv=0x558bb0aee890) at ../src/remote/remote_daemon_dispatch.c:1712
#14 remoteClientCloseFunc (client=<optimized out>) at ../src/remote/remote_daemon_dispatch.c:1781
#15 0x00007fb74ff26b21 in virNetServerClientCloseLocked (client=0x558bb0ad1560) at ../src/rpc/virnetserverclient.c:982
#16 0x00007fb74ff26c3d in virNetServerClientClose (client=0x558bb0ad1560) at ../src/rpc/virnetserverclient.c:1020
#17 0x00007fb74ff2b3ea in virNetServerHandleJob (jobOpaque=0x558bb0aeece0, opaque=0x558bb0ac4880) at ../src/rpc/virnetserver.c:166
#18 0x00007fb74fe66433 in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:164
#19 0x00007fb74fe659e9 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256
#20 0x00007fb74f805d92 in start_thread (arg=<optimized out>) at pthread_create.c:443
#21 0x00007fb74f7a59e0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7fb74e341ac0 (LWP 169327) "virtqemud"):
#0  futex_wait (private=0, expected=2, futex_word=0x7fb74c05f108 <pw_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7fb74c05f108 <pw_mc_ctx+8>, private=0) at lowlevellock.c:50
#2  0x00007fb74f8091b2 in lll_mutex_lock_optimized (mutex=0x7fb74c05f108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:49
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x7fb74c05f108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:89
#4  0x00007fb74c0566da in sss_mt_lock (ctx=0x7fb74c05f100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:61
#5  sss_nss_mc_init_ctx (ctx=0x7fb74c05f100 <pw_mc_ctx>, name=0x7fb74c05c03b "passwd") at src/sss_client/nss_mc_common.c:155
#6  sss_nss_mc_get_ctx (name=name@entry=0x7fb74c05c03b "passwd", ctx=ctx@entry=0x7fb74c05f100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:226
#7  0x00007fb74c0571ee in sss_nss_mc_getpwuid (uid=uid@entry=0, result=result@entry=0x7ffe32f87c70, buffer=buffer@entry=0x558bb0b00200 "", buflen=buflen@entry=1024) at src/sss_client/nss_mc_passwd.c:187
#8  0x00007fb74c0590ae in _nss_sss_getpwuid_r (uid=0, result=0x7ffe32f87c70, buffer=0x558bb0b00200 "", buflen=1024, errnop=0x7fb74e341a38) at src/sss_client/nss_passwd.c:259
#9  0x00007fb74f87db0f in __getpwuid_r (uid=uid@entry=0, resbuf=resbuf@entry=0x7ffe32f87c70, buffer=<optimized out>, buflen=1024, result=result@entry=0x7ffe32f87c60) at ../nss/getXXbyYY_r.c:274
#10 0x00007fb74fe6e531 in virGetUserEnt (uid=0, name=name@entry=0x7ffe32f87cf0, group=group@entry=0x0, dir=dir@entry=0x0, shell=shell@entry=0x0, quiet=quiet@entry=false) at ../src/util/virutil.c:569
#11 0x00007fb74fe6e81a in virGetUserName (uid=<optimized out>) at ../src/util/virutil.c:681
#12 0x00007fb74fe27323 in virIdentityGetSystem () at ../src/util/viridentity.c:330
#13 0x0000558baea8a1be in remoteClientFreePrivateCallbacks (priv=0x558bb0af85d0) at ../src/remote/remote_daemon_dispatch.c:1712
#14 remoteClientCloseFunc (client=<optimized out>) at ../src/remote/remote_daemon_dispatch.c:1781
#15 0x00007fb74ff26b21 in virNetServerClientCloseLocked (client=0x558bb0af0b30) at ../src/rpc/virnetserverclient.c:982
#16 0x00007fb74ff2ad99 in virNetServerProcessClients (srv=0x558bb0ac4880) at ../src/rpc/virnetserver.c:950
#17 0x00007fb74ff2aebd in daemonServerProcessClients (payload=<optimized out>, key=<optimized out>, opaque=<optimized out>) at ../src/rpc/virnetdaemon.c:699
#18 0x00007fb74fe1eebe in virHashForEach (table=<optimized out>, iter=0x7fb74ff2aeb0 <daemonServerProcessClients>, opaque=0x0) at ../src/util/virhash.c:367
#19 0x00007fb74ff28d9b in virNetDaemonRun (dmn=0x558bb0ac3820) at ../src/rpc/virnetdaemon.c:808
#20 0x0000558baea57603 in main (argc=<optimized out>, argv=<optimized out>) at ../src/remote/remote_daemon.c:1211

One is stuck in getpwuid_r and the second one getgrgid_r. Both of those functions seem to end up in the NSS module of sssd.

I'm not sure though how to diagnose such a thing

Comment 10 yalzhang@redhat.com 2022-07-24 00:36:40 UTC

Seems virtqemud wait somewhere:
#  gdb -p `pidof virtqemud`
......
[New LWP 338557]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfos, use: dnf debuginfo-install libnghttp2-1.43.0-5.el9.x86_64 libselinux-3.4-3.el9.x86_64 p11-kit-0.24.1-2.el9.x86_64 yajl-2.1.0-20.el9.x86_64
--Type <RET> for more, q to quit, c to continue without paging--
futex_wait (private=0, expected=2, futex_word=0x7fbcfc033088 <gr_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146
146	  int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
(gdb)

Comment 11 Michal Privoznik 2022-07-26 12:13:32 UTC

Yeah, I tend to agree that the problem is not actually in libvirt but sssd, because the deadlock does not occur between fork() + exec(); it occurs in the main thread (basically). I think this should be switch over to sssd for further investigation.

Comment 12 smitterl 2022-07-26 16:12:22 UTC

virsh hangs for me too. Been using

CTC2 recommended compose id RHEL-9.1.0-20220718.0

sssd version
sssd-2.7.3-1.el9.s390x

I'm not familiar with gstack, can't produce a similarly detailed trace. virsh returns with:

error: Unable to register disconnect callback
error: no valid connection
error: internal error: connection closed due to keepalive timeout


I'm setting priority and urgency critical to raise visibility, I hope that's okay but our CTC2 sign-offs look blocked on this. Please, shout if you disagree.

Comment 13 smitterl 2022-07-26 16:25:43 UTC

(In reply to smitterl from comment #12)
> virsh hangs for me too. Been using
> 
> CTC2 recommended compose id RHEL-9.1.0-20220718.0
> 
> sssd version
> sssd-2.7.3-1.el9.s390x
> 
> I'm not familiar with gstack, can't produce a similarly detailed trace.
> virsh returns with:
> 
> error: Unable to register disconnect callback
> error: no valid connection
> error: internal error: connection closed due to keepalive timeout
> 
> 
> I'm setting priority and urgency critical to raise visibility, I hope that's
> okay but our CTC2 sign-offs look blocked on this. Please, shout if you
> disagree.

Forgot to mention that for me this seems to reproduce always (100%) I run our acceptance_general test job with 7.9 guest on s390x LPAR with indicated compose id. Other jobs running on beaker's virtual KVM hosts seem to be running fine; not sure if a specific test case might cause this behavior.

I'll see if this reproduces with the same job (acceptance_general_guest7, e.g. [1])
a) with a later compose id and the same job ID
b) with the same compose on beaker virtual KVM host (s390x-kvm-XXX)

Comment 15 smitterl 2022-07-26 16:37:12 UTC

(In reply to smitterl from comment #13)
> (In reply to smitterl from comment #12)
> > virsh hangs for me too. Been using
> > 
> > CTC2 recommended compose id RHEL-9.1.0-20220718.0
> > 
> > sssd version
> > sssd-2.7.3-1.el9.s390x
> > 
> > I'm not familiar with gstack, can't produce a similarly detailed trace.

Confirmed offline with Peter that my trace is similar to the highlights in comment#5

Comment 16 Alexey Tikhonov 2022-07-26 17:01:08 UTC

Hi,

does 'virsh' fork under the hood?
I.e. can this be similar to https://bugzilla.redhat.com/show_bug.cgi?id=2069097#c16 ?

(In reply to Peter Krempa from comment #5)
> There are two threads in the stack trace which appear to be stuck in a
> similar place:
...
> I'm not sure though how to diagnose such a thing

Would it be possible to collect coredump of stuck process (and inspect content of `gr_mc_ctx->mutex`, in particular `__data.__owner` to see who holds the mutex)?

Comment 18 Peter Krempa 2022-07-26 18:31:34 UTC

(In reply to Alexey Tikhonov from comment #16)
> Hi,
> 
> does 'virsh' fork under the hood?
> I.e. can this be similar to
> https://bugzilla.redhat.com/show_bug.cgi?id=2069097#c16 ?

The process in question is 'virtqemud' and that indeed does fork heavily. The code path in question is the main process though (noted by having many threads, because once we fork we either exec() or do something trivial and exit), but never spawn more threads, so it should not have a problem with that particular usage at least according to the description in the comment you've linked.


> (In reply to Peter Krempa from comment #5)
> > There are two threads in the stack trace which appear to be stuck in a
> > similar place:
> ...
> > I'm not sure though how to diagnose such a thing
> 
> Would it be possible to collect coredump of stuck process (and inspect
> content of `gr_mc_ctx->mutex`, in particular `__data.__owner` to see who
> holds the mutex)?

Our QE seems to be able to reproduce it reliably, so I'm fairly sure they can get you a coredump or get a debugger and analyze the mutex.

Yalan, can you please either get a coredump or attach debuger to a stuck virtqemud, switch to the thread and get the required values? Thanks.

Comment 19 Alexey Tikhonov 2022-07-26 18:41:39 UTC

(In reply to Peter Krempa from comment #18)
> > 
> > Would it be possible to collect coredump of stuck process (and inspect
> > content of `gr_mc_ctx->mutex`, in particular `__data.__owner` to see who
> > holds the mutex)?
> 
> Our QE seems to be able to reproduce it reliably, so I'm fairly sure they
> can get you a coredump or get a debugger and analyze the mutex.
> 
> Yalan, can you please either get a coredump or attach debuger to a stuck
> virtqemud, switch to the thread and get the required values? Thanks.

In the latter case please also provide output of `info threads`.

Comment 20 yalzhang@redhat.com 2022-07-27 09:37:31 UTC

As the coredump file is too large to be attached to bugzilla, I have uploaded it to google drive. Please download it here:
https://drive.google.com/file/d/1HYbQR09rv0COCkA8rMTMci1C_yZD17cW/view?usp=sharing
Thank you!

Comment 21 Alexey Tikhonov 2022-07-27 18:46:11 UTC

Hi Florian,

Could you please help (again) to debug issue with mutexes:


[Switching to thread 1 (Thread 0x7f2e1d7ebac0 (LWP 621826))]

(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7f2e1c315088 <gr_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7f2e1c315088 <gr_mc_ctx+8>, private=0) at lowlevellock.c:50
#2  0x00007f2e1ecb31b2 in lll_mutex_lock_optimized (mutex=0x7f2e1c315088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:49
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x7f2e1c315088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:89
#4  0x00007f2e1c30c6da in sss_mt_lock (ctx=0x7f2e1c315080 <gr_mc_ctx>) at src/sss_client/nss_mc_common.c:61
#5  sss_nss_mc_init_ctx (ctx=0x7f2e1c315080 <gr_mc_ctx>, name=0x7f2e1c312042 "group") at src/sss_client/nss_mc_common.c:155

(gdb) frame 2
#2  0x00007f2e1ecb31b2 in lll_mutex_lock_optimized (mutex=0x7f2e1c315088 <gr_mc_ctx+8>) at pthread_mutex_lock.c:49
49	    lll_lock (mutex->__data.__lock, private);

(gdb) p *mutex
$1 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 4294967295, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = '\000' <repeats 12 times>, "\377\377\377\377", '\000' <repeats 23 times>, __align = 0}

(gdb) frame 1
#1  __GI___lll_lock_wait (futex=futex@entry=0x7f2e1c315088 <gr_mc_ctx+8>, private=0) at lowlevellock.c:50
50	      futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2.  */


  --  why is this locked? "__lock = 0" && "__owner = 0"  --  mutex isn't locked? what does "__nusers = 4294967295" mean?

Comment 24 Siddhesh Poyarekar 2022-07-29 13:09:47 UTC

Florian is tied up with another project but I can help.  However it looks like you've already figured out the problem; let me know if you need any further support.

Comment 27 Cagney 2022-07-30 04:18:22 UTC

> what does "__nusers = 4294967295" mean?

__nusers is a 32-bit unsigned set to 0-1 (i.e., -1); which looking at the mutex code would suggest that the mutex code decremented something that should have been >0

More interesting is the sequence I lifted from Bug 2111582:

#5  sss_nss_mc_init_ctx (ctx=0x7f229c35c100 <pw_mc_ctx>, name=0x7f229c35903b "passwd") at src/sss_client/nss_mc_common.c:155
#6  sss_nss_mc_get_ctx (name=name@entry=0x7f229c35903b "passwd", ctx=ctx@entry=0x7f229c35c100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:226
#7  0x00007f229c353aa7 in sss_nss_mc_getpwnam (name=name@entry=0x7f22241191b0 "1000", name_len=name_len@entry=4, result=result@entry=0x7f2239ffa400, buffer=buffer@entry=0x7f22240b5b00 "nginx", buflen=buflen@entry=1024) at src/sss_client/nss_mc_passwd.c:110

And this value of ctx:

$1 = {initialized = UNINITIALIZED, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 4294967295, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 12 times>, "\377\377\377\377", '\000' <repeats 23 times>, __align = 0}, fd = -1, seed = 0, mmap_base = 0x0, mmap_size = 0, data_table = 0x0, dt_size = 0, hash_table = 0x0, ht_size = 0, active_threads = 1}

If we assume .initialized==INITIALIZED, we find this code on the error path:

        if (ctx->initialized == INITIALIZED) {
            ctx->initialized = RECYCLED;
        }
        if (ctx->initialized == RECYCLED && ctx->active_threads == 0) {
            /* just one thread should call munmap */
            sss_mt_lock(ctx);
            if (ctx->initialized == RECYCLED) {
                sss_nss_mc_destroy_ctx(ctx);
            }
            sss_mt_unlock(ctx);
        }

A look at sss_nss_mc_destroy_ctx() shows it zeroing out the locked mutex; not good for the unlock() call that but does provide an explination for why __nusers==-1.

Comment 28 Alexey Tikhonov 2022-07-30 10:15:25 UTC

Upstream PR: https://github.com/SSSD/sssd/pull/6289

Comment 30 Alexey Tikhonov 2022-07-30 10:26:43 UTC

(In reply to cagney from comment #27)
> 
> A look at sss_nss_mc_destroy_ctx() shows it zeroing out the locked mutex;

Yeap, already figured this out, thanks.

Comment 32 Alexey Tikhonov 2022-08-03 08:48:32 UTC

Pushed PR: https://github.com/SSSD/sssd/pull/6289

* `master`
    * 0f3a761ed9d654a61f8caed8eae3863c518b9911 - CLIENT:MC: store context mutex outside of context as it should survive context destruction / re-initialization
* `sssd-2-7`
    * 03142f8de42faf4f75465d24d3be9a49c2dd86f7 - CLIENT:MC: store context mutex outside of context as it should survive context destruction / re-initialization

Comment 34 yalzhang@redhat.com 2022-08-09 11:43:03 UTC

Run the auto cases "domiftune" (53 cases) for 2 times, can not reproduce the issue, the bug is fixed.

Comment 38 Peter Krempa 2022-08-25 13:06:40 UTC

*** Bug 2120965 has been marked as a duplicate of this bug. ***

Comment 39 IBM Bug Proxy 2022-08-25 13:16:03 UTC

------- Comment From tstaudt.com 2022-08-25 08:46 EDT-------
(In reply to comment #14)
> Also, can you please tell us what version of libvirt do you have?

libvirt-8.5.0-1.el9.s390x

We currently can't update due to another bug we're checking, but will eventually try the new version if required.

Comment 40 IBM Bug Proxy 2022-09-13 12:30:54 UTC

------- Comment From tstaudt.com 2022-09-13 08:28 EDT-------
Hello Red Hat,

it seems that the duplicated bug
Red Hat Bug 2120965 - RHEL9.1 Nightly Build - virsh becomes unusable after running virsh commands in parallel
is not fixed with this update.

How do you suggest we move on? I fear re-opening will mess up the mirroring.
Thanks.

Comment 41 Cagney 2022-09-13 13:22:17 UTC

See the fedora bugs:
Bug 2075736 - has workaround by not shutting down daemons
Bug 2111582 - fixed by sssd it seems
Bug 2123828 - workaround is to restart virtqemud before a run

Comment 42 Alexey Tikhonov 2022-09-13 16:57:49 UTC

(In reply to IBM Bug Proxy from comment #40)
> ------- Comment From tstaudt.com 2022-09-13 08:28 EDT-------
> Hello Red Hat,
> 
> it seems that the duplicated bug
> Red Hat Bug 2120965 - RHEL9.1 Nightly Build - virsh becomes unusable after
> running virsh commands in parallel
> is not fixed with this update.
> 
> How do you suggest we move on? I fear re-opening will mess up the mirroring.

I think you can re-open bz 2120965, there is no mirroring.

Comment 43 IBM Bug Proxy 2022-09-14 07:50:58 UTC

------- Comment From tstaudt.com 2022-09-14 03:42 EDT-------
I have reopened Red Hat bug 2120965, and the correct mirror link between that and the IBM bug will be restored later today.
Thanks.

Comment 45 errata-xmlrpc 2022-11-15 11:17:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8325

Note You need to log in before you can comment on or make changes to this bug.

atikhono
bugproxy
cagney
dzheng
fjin
fweimer
jen
jil
jsuchane
lcheng
lizhu
mprivozn
pbrezina
pkrempa
sgadekar
sipoyare
smitterl
virt-maint
weizhan
yoyang