2111582 – virtqemud deadlocking

Bug 2111582 - virtqemud deadlocking

Summary: virtqemud deadlocking

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sssd
Sub Component:
Version:	36
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Alexey Tikhonov
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	sync-to-jira
Duplicates (2):	2116164 2120680 (view as bug list)
Depends On:	2109451
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-27 14:36 UTC by Cagney
Modified:	2023-04-25 18:55 UTC (History)
CC List:	23 users (show)
Fixed In Version:	sssd-2.7.4-1.fc36
Clone Of:
Environment:
Last Closed:	2023-04-25 18:55:15 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
backtrace from virtqemud (21.93 KB, text/plain) 2022-07-27 14:36 UTC, Cagney	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	SSSD-4910	0	None	None	None	2022-07-30 10:27:44 UTC

Description Cagney 2022-07-27 14:36:58 UTC

Created attachment 1899683 [details]
backtrace from virtqemud

Description of problem:

Cloning an issue found in #2075736

Version-Release number of selected component (if applicable):


How reproducible:

100%

Steps to Reproduce:
1. enable the modula virt daemons
2. config the mondula deamons to not timeout vis
# grep ARGS /etc/sysconfig/virt*
/etc/sysconfig/virtnetworkd:VIRTNETWORKD_ARGS=""
/etc/sysconfig/virtqemud:VIRTQEMUD_ARGS=""
/etc/sysconfig/virtstoraged:VIRTSTORAGED_ARGS=""
3. load up libvirt with lots of creating, starting, stopping and deleting


Actual results:

virt-install, for instance hangs (for hours)

Additional info:

Workaround 1 is:
  sudo gstack $(pgrep virtqemud) > /dev/null
which will unblock virtqemud (presumably it causes a lock blocked in a syscall to retry)

Workaround 2 is to stick with the monolythic daemon?

Comment 1 Cagney 2022-07-29 17:16:08 UTC

Debuginfo backtrace of rpc-libvirtd thread stuck in futex:

(gdb) thread 22 bt
[Switching to thread 22 (Thread 0x7f2239ffb640 (LWP 56946))]
#0  futex_wait (private=0, expected=2, futex_word=0x7f229c35c108 <pw_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146
Downloading 0.00 MB source file /usr/src/debug/glibc-2.35-15.fc36.x86_64/nptl/../sysdeps/nptl/futex-internal.h
146	  int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7f229c35c108 <pw_mc_ctx+8>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7f229c35c108 <pw_mc_ctx+8>, private=0) at lowlevellock.c:49
#2  0x00007f22c78901e2 in lll_mutex_lock_optimized (mutex=0x7f229c35c108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x7f229c35c108 <pw_mc_ctx+8>) at pthread_mutex_lock.c:93
#4  0x00007f229c353736 in sss_mt_lock (ctx=0x7f229c35c100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:61
#5  sss_nss_mc_init_ctx (ctx=0x7f229c35c100 <pw_mc_ctx>, name=0x7f229c35903b "passwd") at src/sss_client/nss_mc_common.c:155
#6  sss_nss_mc_get_ctx (name=name@entry=0x7f229c35903b "passwd", ctx=ctx@entry=0x7f229c35c100 <pw_mc_ctx>) at src/sss_client/nss_mc_common.c:226
#7  0x00007f229c353aa7 in sss_nss_mc_getpwnam (name=name@entry=0x7f22241191b0 "1000", name_len=name_len@entry=4, result=result@entry=0x7f2239ffa400, buffer=buffer@entry=0x7f22240b5b00 "nginx", buflen=buflen@entry=1024)
    at src/sss_client/nss_mc_passwd.c:110
#8  0x00007f229c355d89 in _nss_sss_getpwnam_r (name=0x7f22241191b0 "1000", result=0x7f2239ffa400, buffer=0x7f22240b5b00 "nginx", buflen=1024, errnop=0x7f2239ffb5b8) at src/sss_client/nss_passwd.c:153
#9  0x00007f22c78dc3a3 in __getpwnam_r (name=name@entry=0x7f22241191b0 "1000", resbuf=resbuf@entry=0x7f2239ffa400, buffer=<optimized out>, buflen=1024, result=result@entry=0x7f2239ffa3f0) at ../nss/getXXbyYY_r.c:273
#10 0x00007f22c7f5d74a in virGetUserIDByName (name=0x7f22241191b0 "1000", uid=0x7f2239ffa4b4, missing_ok=missing_ok@entry=false) at ../src/util/virutil.c:714
#11 0x00007f22c7f5e3ef in virGetUserID (user=user@entry=0x7f22241191b0 "1000", uid=uid@entry=0x7f2239ffa4b4) at ../src/util/virutil.c:755
#12 0x00007f22c7f5e575 in virParseOwnershipIds (label=<optimized out>, uidPtr=uidPtr@entry=0x7f2239ffa544, gidPtr=gidPtr@entry=0x7f2239ffa540) at ../src/util/virutil.c:1320
#13 0x00007f229e4b4166 in qemuDomainGetImageIds
    (cfg=cfg@entry=0x7f225c20a010, vm=vm@entry=0x7f225c5521f0, src=src@entry=0x7f22a40be1b0, parentSrc=parentSrc@entry=0x7f22a40be1b0, uid=uid@entry=0x7f2239ffa544, gid=gid@entry=0x7f2239ffa540) at ../src/qemu/qemu_domain.c:7374
#14 0x00007f229e4bf02d in qemuDomainDetermineDiskChain (driver=0x7f225c205d80, vm=0x7f225c5521f0, disk=0x7f22a80e3aa0, disksrc=0x7f22a40be1b0, report_broken=true) at ../src/qemu/qemu_domain.c:7594
#15 0x00007f229e5558db in qemuProcessPrepareHostStorage (flags=17, vm=0x7f225c5521f0, driver=0x7f225c205d80) at ../src/qemu/qemu_process.c:6691
#16 qemuProcessPrepareHost (driver=0x7f225c205d80, vm=<optimized out>, flags=17) at ../src/qemu/qemu_process.c:7066
#17 0x00007f229e55b811 in qemuProcessStart
    (conn=0x7f22a0003b30, driver=0x7f225c205d80, vm=0x7f225c5521f0, updatedCPU=0x0, asyncJob=QEMU_ASYNC_JOB_START, migrateFrom=0x0, migrateFd=-1, migratePath=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=<optimized out>) at ../src/qemu/qemu_process.c:7845
#18 0x00007f229e578726 in qemuDomainObjStart.constprop.0 (conn=0x7f22a0003b30, driver=driver@entry=0x7f225c205d80, vm=0x7f225c5521f0, flags=flags@entry=0, asyncJob=QEMU_ASYNC_JOB_START) at ../src/qemu/qemu_driver.c:6372
#19 0x00007f229e4dac6d in qemuDomainCreateWithFlags (dom=0x7f22b0022b80, flags=0) at ../src/qemu/qemu_driver.c:6424
#20 0x00007f22c812a816 in virDomainCreate (domain=domain@entry=0x7f22b0022b80) at ../src/libvirt-domain.c:6740
#21 0x000056079b57f33e in remoteDispatchDomainCreate (server=0x56079bd5c880, msg=0x56079f5d9100, args=<optimized out>, rerr=0x7f2239ffa970, client=0x56079d0f6200) at src/remote/remote_daemon_dispatch_stubs.h:4947
#22 remoteDispatchDomainCreateHelper (server=0x56079bd5c880, client=0x56079d0f6200, msg=0x56079f5d9100, rerr=0x7f2239ffa970, args=<optimized out>, ret=0x0) at src/remote/remote_daemon_dispatch_stubs.h:4926
#23 0x00007f22c801ae1f in virNetServerProgramDispatchCall (msg=0x56079f5d9100, client=0x56079d0f6200, server=0x56079bd5c880, prog=0x56079bdcb010) at ../src/rpc/virnetserverprogram.c:428
#24 virNetServerProgramDispatch (prog=0x56079bdcb010, server=0x56079bd5c880, client=0x56079d0f6200, msg=0x56079f5d9100) at ../src/rpc/virnetserverprogram.c:302
#25 0x00007f22c801b332 in virNetServerProcessMsg (srv=srv@entry=0x56079bd5c880, client=0x56079d0f6200, prog=<optimized out>, msg=0x56079f5d9100) at ../src/rpc/virnetserver.c:140
#26 0x00007f22c801b441 in virNetServerHandleJob (jobOpaque=0x56079d1edf00, opaque=0x56079bd5c880) at ../src/rpc/virnetserver.c:160
#27 0x00007f22c7f5827d in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:164
#28 0x00007f22c7f5778b in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256
#29 0x00007f22c788ce2d in start_thread (arg=<optimized out>) at pthread_create.c:442
#30 0x00007f22c79121b0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) print *ctx
$1 = {initialized = UNINITIALIZED, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 4294967295, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 12 times>, "\377\377\377\377", '\000' <repeats 23 times>, __align = 0}, fd = -1, seed = 0, mmap_base = 0x0, mmap_size = 0, data_table = 0x0, dt_size = 0, hash_table = 0x0, ht_size = 0, active_threads = 1}
(gdb) 

#10 0x00007f22c7f5d74a in virGetUserIDByName (name=0x7f22241191b0 "1000", uid=0x7f2239ffa4b4, missing_ok=missing_ok@entry=false) at ../src/util/virutil.c:714
714	    while ((rc = getpwnam_r(name, &pwbuf, strbuf, strbuflen, &pw)) == ERANGE) {

shouldn't name==cagney?  Ah, probably because the VM has:

  <seclabel type='static' model='dac' relabel='yes'>
   <label>1000:107</label>
   <imagelabel>1000:107</imagelabel>
  </seclabel>

Comment 2 yalzhang@redhat.com 2022-07-29 23:45:38 UTC

It very similar to bug 2109451. Could you please help to check the sssd version?

Comment 3 Cagney 2022-07-30 01:22:25 UTC

$ rpm -q sssd
sssd-2.7.3-1.fc36.x86_64

Comment 4 Cagney 2022-07-30 02:23:28 UTC

Probably the same; marking this as blocked by that.

This bug locks up in getpwnam*().
Bug 2109451 locks up in  getpwuid*()

so fixing latter should fix this.

Comment 5 Alexey Tikhonov 2022-07-30 10:20:08 UTC

(In reply to cagney from comment #4)
> Probably the same

It is.
Fix already posted for review upstream: https://github.com/SSSD/sssd/pull/6289

Comment 6 Cagney 2022-07-30 13:18:59 UTC

This was extracted from Bug 2075736

Comment 7 Alexey Tikhonov 2022-08-03 08:48:17 UTC

Pushed PR: https://github.com/SSSD/sssd/pull/6289

* `master`
    * 0f3a761ed9d654a61f8caed8eae3863c518b9911 - CLIENT:MC: store context mutex outside of context as it should survive context destruction / re-initialization
* `sssd-2-7`
    * 03142f8de42faf4f75465d24d3be9a49c2dd86f7 - CLIENT:MC: store context mutex outside of context as it should survive context destruction / re-initialization

Comment 8 Alexey Tikhonov 2022-08-08 08:22:19 UTC

*** Bug 2116164 has been marked as a duplicate of this bug. ***

Comment 9 Cagney 2022-08-16 13:46:01 UTC

Did this fix get pushed to Fedora?

Comment 10 Alexey Tikhonov 2022-08-16 14:30:53 UTC

(In reply to cagney from comment #9)
> Did this fix get pushed to Fedora?

Not yet.

Comment 11 Alexey Tikhonov 2022-08-17 12:49:42 UTC

Moreover, additional patch is required: https://github.com/SSSD/sssd/pull/6305

Comment 12 Lukas Slebodnik 2022-08-18 09:45:11 UTC

Maybe it would be faster to temporarily revert related changes introduced in 2.7.3
* https://sssd.io/release-notes/sssd-2.7.3.html

It would be nice to unblock fedora user as fast as possible.
BZ is opened for more than 3 weeks.

Comment 13 Alexey Tikhonov 2022-08-18 14:14:20 UTC

https://github.com/SSSD/sssd/pull/6305 accepted upstream and tested by QE in bz 2116389

Comment 14 Alexey Tikhonov 2022-08-19 15:33:08 UTC

Pushed PR: https://github.com/SSSD/sssd/pull/6305

* `master`
    * 4ac93d9c5df59cdb7f397b4467f1c1c4822ff757 - CLIENT:MC: pointer to the context mutex shouldn't be touched
    * 579cc0b266d5f8954bc71cfcd3fe68002d681a5f - CLIENT:MC: -1 is more appropriate initial value for fd
* `sssd-2-7`
    * d386e94ef49d95d7305a3e6578e41a2cf61dfc5c - CLIENT:MC: pointer to the context mutex shouldn't be touched
    * 0eae0862069e4bbbdd87b809193fc873f3003cff - CLIENT:MC: -1 is more appropriate initial value for fd

Comment 15 Alexey Tikhonov 2022-08-23 14:26:26 UTC

*** Bug 2120680 has been marked as a duplicate of this bug. ***

Comment 16 Lukas Slebodnik 2022-08-25 14:32:33 UTC

(In reply to Alexey Tikhonov from comment #14)
> Pushed PR: https://github.com/SSSD/sssd/pull/6305
> 
> * `master`
>     * 4ac93d9c5df59cdb7f397b4467f1c1c4822ff757 - CLIENT:MC: pointer to the
> context mutex shouldn't be touched
>     * 579cc0b266d5f8954bc71cfcd3fe68002d681a5f - CLIENT:MC: -1 is more
> appropriate initial value for fd
> * `sssd-2-7`
>     * d386e94ef49d95d7305a3e6578e41a2cf61dfc5c - CLIENT:MC: pointer to the
> context mutex shouldn't be touched
>     * 0eae0862069e4bbbdd87b809193fc873f3003cff - CLIENT:MC: -1 is more
> appropriate initial value for fd

This is a great new it is already fixed in upstream.
But looks like another week passed and fedora users are still affected.
Is there any ETA when patches will be backported to fedora?
Or at least temporarily revert related changes introduced in 2.7.3

We(fedora users) would appreciate a progress here :-)

Comment 17 Cagney 2022-09-01 12:24:03 UTC

It seems to help.
I'm back to virsh operations taking 1s (they should take no time) rather than hanging.  I'll follow that up in bug 2075736

Comment 18 Ben Cotton 2023-04-25 18:27:44 UTC

This message is a reminder that Fedora Linux 36 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '36'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 36 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Note You need to log in before you can comment on or make changes to this bug.

abokovoy
agedosier
atikhono
berrange
cagney
clalancette
crobinso
jforbes
jhrozek
laine
libvirt-maint
louis
lslebodn
luk.claes
mzidek
pbrezina
redhat
sbose
ssorce
sssd-maintainers
veillard
virt-maint
yalzhang