Given that sssd lacks a proper cache changes notification system that clients can use to know when to gather data, clients like GOA need to poll sssd constantly. When using the KCM backend we've noticed that this has had a considerable impact on memory and CPU usage (and therefore battery usage on laptops). I suggest we switch to "default_ccache_name = KEYRING:" in /etc/krb5.conf.d/kcm_default_ccache until #1551648 is resolved.
People that want to revert to keyring for "performance" reasons, can simply dnf uninstall sssd-kcm, I would rather address why GOA is hammering on thw caches so much that it becomes a performance issue.
(In reply to Simo Sorce from comment #1) > People that want to revert to keyring for "performance" reasons, can simply > dnf uninstall sssd-kcm, I would rather address why GOA is hammering on thw > caches so much that it becomes a performance issue. A bit unfair, first, all users are impacted by this, whether they notice or care is another matter. We could also argue that people who use kerberos at all (which I suspect is a small minority of our installed base) could just dnf install sssd-kcm and so removing it altogether is the way to go. My understanding is that GOA can't fix this without a notification mechanism from the cache to avoid polling. I will let rishi chime in (added him to needinfo), and indeed, if we can find a way to fix this from GOA, I'm fine with that too. However, with my Fedora Workstation hat on, this change introduces a performance regression that was not present before that should have been addressed. Most users do not user kerberos services. If no upstream (GOA and SSSD) is willing to do the work to satisfy each other's requirements I'm leaning towards suggesting what's best for the bulk of the userbase and revert the change. Both parties have known this was an issue from the start and two Fedora releases later there has been no progress.
Unless GOA contantly polls the cache at very a high rate there is no way there can be a performance/battery issue here. The question is, why GOA is hammering KCM ? Sounds like a bug in GOA (note that I uses sssd-kcm regularly and I do not have high CPU usage, so it is GOA causing it somehow).
(In reply to Simo Sorce from comment #3) > Unless GOA contantly polls the cache at very a high rate there is no way > there can be a performance/battery issue here. > The question is, why GOA is hammering KCM ? Sounds like a bug in GOA (note > that I uses sssd-kcm regularly and I do not have high CPU usage, so it is > GOA causing it somehow). Well that's exactly what's happening, quoting rishi from https://bugzilla.redhat.com/show_bug.cgi?id=1551648#c28 : > If the credentials cache doesn't support any notification mechanism, goa- > identity-service polls the credentials cache every 5 seconds using libkrb5 API > to stay in sync with any changes done using the krb5 CLI (ie., kinit, kdestroy, > etc.). I do not know why it needs to do that, that's for rishi to clarify.
GOA uses the same code paths for KCM and KEYRING caches, but I won't be surprised if GOA's use of libkrb5 isn't 100% accurate which manifests as weird misbehaviour with KCM caches. I learnt from Fabiano that KCM has gotten a lot better over time, so I will give it a shot again and see what I can come up with.
The KCM part of libkrb5 is not the best tested part either, so the issue might be anywhere between goa->libkrb5->sssd_kcm. What would help the most would be to have a reliable reproducer or reproduce the issue on some VM or beaker machine we can poke at.
(In reply to Jakub Hrozek from comment #6) > The KCM part of libkrb5 is not the best tested part either, so the issue > might be anywhere between goa->libkrb5->sssd_kcm. > > What would help the most would be to have a reliable reproducer or reproduce > the issue on some VM or beaker machine we can poke at. You just need a vanilla F28 or F29 Workstation install, after you login to your default desktop session you should see sssd_kcm eating CPU time.
(In reply to Alberto Ruiz from comment #7) > (In reply to Jakub Hrozek from comment #6) > > The KCM part of libkrb5 is not the best tested part either, so the issue > > might be anywhere between goa->libkrb5->sssd_kcm. > > > > What would help the most would be to have a reliable reproducer or reproduce > > the issue on some VM or beaker machine we can poke at. > > You just need a vanilla F28 or F29 Workstation install, after you login to > your default desktop session you should see sssd_kcm eating CPU time. Nope, not happening here (as it was not happening for the last ~2 years I've been running KCM on my laptop..)
(In reply to Jakub Hrozek from comment #8) > Nope, not happening here (as it was not happening for the last ~2 years I've > been running KCM on my laptop..) What does "here" mean? Is your personal machine a Fedora Workstation 29 vanilla install?
Created attachment 1502929 [details] Screenshot of sssd_kcm showing up in top on a vanilla F29 Workstation Here's a screenshot of sssd_kcm showing up in top. The process goes up and down in CPU consumption at regular intervals of a couple of seconds.
(In reply to Alberto Ruiz from comment #9) > (In reply to Jakub Hrozek from comment #8) > > Nope, not happening here (as it was not happening for the last ~2 years I've > > been running KCM on my laptop..) > > What does "here" mean? Is your personal machine a Fedora Workstation 29 > vanilla install? Yes, since about a week ago when I got a new laptop. Before that, I was using a different machine also with Workstation and KCM and didn't see the issue there either.
(In reply to Jakub Hrozek from comment #11) > (In reply to Alberto Ruiz from comment #9) > > (In reply to Jakub Hrozek from comment #8) > > > Nope, not happening here (as it was not happening for the last ~2 years I've > > > been running KCM on my laptop..) > > > > What does "here" mean? Is your personal machine a Fedora Workstation 29 > > vanilla install? > > Yes, since about a week ago when I got a new laptop. Before that, I was > using a different machine also with Workstation and KCM and didn't see the > issue there either. I can consistently reproduce on a fresh installed VM and my own machine. Would you please create a VM as I suggested originally and check?
Alberto, do you configure the machine in any specific way after first boot ? Do you join an IPA/AD domain ? Do you set up GOA to automatically fetch some kerberos credentials ? Do you just *not* use any kerberos authentication ? I believe the "auth" setup you apply on your vanilla VM may be important to be able to reproduce.
(In reply to Simo Sorce from comment #13) > Alberto, > do you configure the machine in any specific way after first boot ? > Do you join an IPA/AD domain ? > Do you set up GOA to automatically fetch some kerberos credentials ? > Do you just *not* use any kerberos authentication ? > > I believe the "auth" setup you apply on your vanilla VM may be important to > be able to reproduce. What I meant with vanilla was that I just create a plain /etc/passwd user, no kerberos auth, no online account of any type no nothing. What a tyipical user would do to install Fedora for the first time.
(In reply to Alberto Ruiz from comment #14) > (In reply to Simo Sorce from comment #13) > > Alberto, > > do you configure the machine in any specific way after first boot ? > > Do you join an IPA/AD domain ? > > Do you set up GOA to automatically fetch some kerberos credentials ? > > Do you just *not* use any kerberos authentication ? > > > > I believe the "auth" setup you apply on your vanilla VM may be important to > > be able to reproduce. > > What I meant with vanilla was that I just create a plain /etc/passwd user, > no kerberos auth, no online account of any type no nothing. What a tyipical > user would do to install Fedora for the first time. Ah, OK, then I don't have a vanilla setup as the very first thing I do is to configure my redhat.com account..
This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Please file a new BZ with better subject and more deterministic and minimal reproducer.
#c0 is pretty clear to me. Reporter is asking for ccache backend to be switched back to KEYRING due to battery drain on laptops caused by having to poll KCM.
> #c0 is pretty clear to me. Reporter is asking for ccache backend to be switched back to KEYRING due to battery drain on laptops caused by having to poll KCM. I do not have any problem with kcm on my laptop. And sssd-kcm works for many others. Sure there might be issues in some cases but it does not make a sense to change default for everyone to KEYRING due to corner-cases. If somebody want to change default for everyone then it need to be processes in fedora way. Inverse https://fedoraproject.org/wiki/Releases/27/ChangeSet#Kerberos_KCM_credential_cache_by_default Removing /etc/krb5.conf.d/kcm_default_ccache or even sssd-kcm is trivial solution for changing default back to KEYRING for affected people.
Lukas, you are not member of the SSSD team and you are not a maintainer of sssd component. We welcome your contributions, but please restrain from doing any decisions for us. I'm reopening the bug. Since we can't reproduce it I do not think that switching the default is justified, but we should keep tracking it. There were many fixes in KCM for the past year so perhaps it is no longer an issue? In the meantime, we should prioritize https://pagure.io/SSSD/sssd/issue/3568
(In reply to Alberto Ruiz from comment #10) > Created attachment 1502929 [details] > Screenshot of sssd_kcm showing up in top on a vanilla F29 Workstation > > Here's a screenshot of sssd_kcm showing up in top. The process goes up and > down in CPU consumption at regular intervals of a couple of seconds. On this screenshot sssd_kcm is "eating" 1% of CPU. I don't say this is negligible but that doesn't look *that* terrible (esp taking into account at my F29 laptop gnome-shell usually "eats" 10+% at any given random moment of time...)
(In reply to Alexey Tikhonov from comment #21) > (In reply to Alberto Ruiz from comment #10) > > Created attachment 1502929 [details] > > Screenshot of sssd_kcm showing up in top on a vanilla F29 Workstation > > > > Here's a screenshot of sssd_kcm showing up in top. The process goes up and > > down in CPU consumption at regular intervals of a couple of seconds. > > > On this screenshot sssd_kcm is "eating" 1% of CPU. I don't say this is > negligible but that doesn't look *that* terrible (esp taking into account at > my F29 laptop gnome-shell usually "eats" 10+% at any given random moment of > time...) I am not impressed by that argument, 1% of the CPU is outrageously high for a pervasive daemon that is not even used by most people, you are only second to GNOME Shell there and that has an impact in battery life. My question is, why is it consuming 1%, what value is it bringing to the end user? Is it worth it? Can we improve it? Each little daemon using GNOME Shell as an excuse to eat obscene amounts of RAM or CPU is not an impressive argument and would render the Fedora default install unusable. Please take ownership of the impact of your design decisions on the overall system and the end users.
(In reply to Alberto Ruiz from comment #22) > > > Here's a screenshot of sssd_kcm showing up in top. The process goes up and > > > down in CPU consumption at regular intervals of a couple of seconds. > > > > On this screenshot sssd_kcm is "eating" 1% of CPU. I don't say this is > > negligible but that doesn't look *that* terrible (esp taking into account at > > my F29 laptop gnome-shell usually "eats" 10+% at any given random moment of > > time...) > > I am not impressed by that argument, 1% of the CPU is outrageously high for > a pervasive daemon that is not even used by most people I just wanted to make sure we are on the same page when talk about "considerable impact on memory and CPU usage" Well, if this is it, then it is really easily reproducible. Typical top on my F29 VM: 2930 test 20 0 3071680 252424 105284 S 1.7 6.3 0:09.29 gnome-shell 3074 root 20 0 253684 30388 9100 S 1.0 0.8 0:01.98 sssd_kcm <----------- 1376 dirsrv 20 0 1716528 355488 70532 S 0.7 8.8 0:56.30 ns-slapd 3487 test 20 0 569700 43788 32984 S 0.7 1.1 0:01.03 gnome-terminal- 2055 pkiuser 20 0 3759372 509716 30400 S 0.3 12.6 0:30.09 java 3067 test 20 0 471820 15476 13372 S 0.3 0.4 0:00.14 goa-identity-se 3575 test 20 0 228912 5180 4280 R 0.3 0.1 0:00.99 top 1 root 20 0 171624 14444 9392 S 0.0 0.4 0:02.22 systemd But reading bz 1551648, it seems to be the root cause...
I am surprised sssd-kcm uses that much CPU on ant given system, do we have any idea what it is doing? Id there some other process in the system that keeps querying the kerberos ccache causing it to do work? Or do we have some timed tasks we can rearrange to avoid firing when there is nothing to do?
(In reply to Simo Sorce from comment #24) > I am surprised sssd-kcm uses that much CPU on ant given system, do we have > any idea what it is doing? > Id there some other process in the system that keeps querying the kerberos > ccache causing it to do work? I got an impression that this process was goa (according to https://bugzilla.redhat.com/show_bug.cgi?id=1551648#c27)
This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle. Changing version to 32.
Hi, with current master I no longer see sssd_kcm using this much CPU. Maybe it was already fixed.
Hopefully. Still seeing with: sssd-kcm-2.2.3-13.fc31.x86_64 gnome-online-accounts-3.34.1-1.fc31.x86_64
*** Bug 1670451 has been marked as a duplicate of this bug. ***
Can we please change the title of this bug to reflect what issue is being tracked? If the issue really is a RFE to change back to keyring I'd propose to close this bug as WONTFIX, becasue KCM is the only sane way to share ccaches with containers and flatpacks for example. Namespacing keyring helps but does not allow you to build a portal for faltpack for example which is what is needed in that case.
Being bitten by it here as well. A bit of history: At work we use Fedora laptops connected to FreeIPA. In the past (around Fedora 30), we had this issue and we removed sssd-kcm to make it work. We retested with Fedora 31 and it was working fine, so we re-enabled sssd-kcm on the laptops. Now that we upgraded to Fedora 32, we see it again. sssd_kcm hitting 100% CPU and the only way to make it spin down is killing goa-identity-service.
By then starting manually goa-identity-service like the following: $ G_MESSAGES_DEBUG=all /usr/libexec/goa-identity-service I get hundreds of lines per second on the screen with this entries: (process:62345): libgoaidentity-DEBUG: 16:21:41.900: GoaKerberosIdentityManager: refreshing identity 'simone.caronni' (process:62345): libgoaidentity-DEBUG: 16:21:41.900: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni (process:62345): libgoaidentity-DEBUG: 16:21:41.912: GoaKerberosIdentityManager: refreshing identity 'simone.caronni' (process:62345): libgoaidentity-DEBUG: 16:21:41.913: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni (process:62345): libgoaidentity-DEBUG: 16:21:41.914: GoaKerberosIdentityManager: Waiting for next operation (process:62345): libgoaidentity-DEBUG: 16:21:41.914: GoaKerberosIdentityManager: Refreshing identities (process:62345): libgoaidentity-DEBUG: 16:21:41.929: GoaKerberosIdentityManager: refreshing identity 'simone.caronni' (process:62345): libgoaidentity-DEBUG: 16:21:41.930: GoaKerberosIdentityManager: identity unexpired (process:62345): libgoaidentity-DEBUG: 16:21:41.930: GoaKerberosIdentityManager: identity 'simone.caronni' refreshed (process:62345): libgoaidentity-DEBUG: 16:21:41.932: GoaIdentityService: credentials for account simone.caronni ensured for 73340 seconds (process:62345): libgoaidentity-DEBUG: 16:21:41.944: GoaKerberosIdentityManager: refreshing identity 'simone.caronni' (process:62345): libgoaidentity-DEBUG: 16:21:41.944: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni (process:62345): libgoaidentity-DEBUG: 16:21:41.946: GoaIdentityService: identity simone.caronni expired (process:62345): libgoaidentity-DEBUG: 16:21:41.948: GoaIdentityService: could not ensure credentials for account simone.caronni: GDBus.Error:org.gnome.OnlineAccounts.Error.NotAuthorized: No credentials found in the keyring (process:62345): libgoaidentity-DEBUG: 16:21:41.957: GoaKerberosIdentityManager: refreshing identity 'simone.caronni' (process:62345): libgoaidentity-DEBUG: 16:21:41.957: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni (process:62345): libgoaidentity-DEBUG: 16:21:41.967: GoaKerberosIdentityManager: refreshing identity 'simone.caronni' (process:62345): libgoaidentity-DEBUG: 16:21:41.967: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni (process:62345): libgoaidentity-DEBUG: 16:21:41.969: GoaKerberosIdentityManager: Waiting for next operation (process:62345): libgoaidentity-DEBUG: 16:21:41.969: GoaKerberosIdentityManager: Refreshing identities
And goa-daemon --replace keeps on printing this: (goa-daemon:63090): GoaBackend-WARNING **: 16:24:52.576: secret_password_lookup_sync() returned NULL goa-daemon-Message: 16:24:52.576: /org/gnome/OnlineAccounts/Accounts/account_1591340703_0: Setting AttentionNeeded to TRUE because EnsureCredentials() failed with: No credentials found in the keyring (goa-error-quark, 4) goa-daemon-Message: 16:24:52.785: /org/gnome/OnlineAccounts/Accounts/account_1591340703_0: Setting AttentionNeeded to FALSE because EnsureCredentials() succeded
If i clear the credential the multiple expired caches, goa-identity-service stops polling like crazy. It seems it's trying to renew the KCM entries that are expired instead of using the valid one: $ klist -l Principal name Cache name -------------- ---------- simone.caronni.XX KCM:46200010:44571 (Expired) simone.caronni.XX KCM:46200010:17939 simone.caronni.XX KCM:46200010:40637 (Expired) simone.caronni.XX KCM:46200010:97892 (Expired) (realm redacted)
Related KCM bug: https://bugzilla.redhat.com/show_bug.cgi?id=1780308
In case it helps others: I had only one expired entry in `klist -l`, renewed it with `kinit`, and then the sssd-kcm process dropped from 50-100% CPU down to <=1%. The issue happened AFAICJ for the first time in my love life with Fedora (since 19 or 20, now on 32). And just because it took me a while to find out: GOA = gnome-online-accounts (I assume).
This is interesting, so it seems that GOA is polling for all (even expired) available tickets which triggers the high cpu usage? SSSD has implemented notification mechanism and we have currently provided scratch build to desktop team to see if it fits their need. But this is something that should be checked in GOA anyway.
Not sure it's the same issue, but I recently noticed sssd-kcm briefly (10-20 seconds) spiking to 100% cpu each time I bring up a remote VM's console in virt-manager (over a remote connection using SSH, with pubkey authentication). My machine is otherwise self-contained F32 (local account, no need to use kerberos besides whatever is involved with fedpkg). Happy to collect whatever test data might be useful to troubleshoot, please advise.
I think 10-20 seconds spike may be normal, because KCM is not running in the background all the time. It is started only when there is some activity so this may be just kcm starting and serving the request. But feel free to attach sssd_kcm.log logs (/var/log/sssd/sssd_kcm.log) after setting debug_level = 0x3ff0 in /etc/sssd/sssd.conf: [kcm] debug_level = 0x3ff0 Output of the ssh connection with -vvv may be also helpful to see if there is supposed to be any interaction with kcm.
(In reply to Pavel Březina from comment #41) > I think 10-20 seconds spike may be normal, because KCM is not running in the > background all the time. It is started only when there is some activity so > this may be just kcm starting and serving the request. Not only does kcm spike the cpu for 20-ish seconds when I try to open a VM console in virt-manager, the vm console opening itself is delayed for roughly the same time interval (i.e., it's waiting for sssd_kcm to spin its wheels, then give up, before moving on with the connection). > But feel free to > attach sssd_kcm.log logs (/var/log/sssd/sssd_kcm.log) after setting > debug_level = 0x3ff0 in /etc/sssd/sssd.conf: > > [kcm] > debug_level = 0x3ff0 There was no sssd.conf file in /etc/sssd/ on my machine. I created it with the content you indicated above, then did "systemctl enable sssd.socket" before opening one of my VMs in virt-manager. After the whole song and dance was over, I shut down both sssd_kcm .service and .socket, but there was only one line in the log file: (2020-08-24 10:52:23): [kcm] [orderly_shutdown] (0x0010): SIGTERM: killing children > Output of the ssh connection with -vvv may be also helpful to see if there > is supposed to be any interaction with kcm. It's plausible that the ssh client will try a bunch of auth methods in some sort of order, and that it tries kerberos auth before falling back to pubkey, which is what I'm using between my workstation and the hypervisor running the VMs. Which would explain why sssd_kcm "spinning" is delaying my VM guest consoles coming up. But I don't know how I would add "-vvv" to the ssh client session started by virt-manager. I did a plain command-line `ssh -vvv root@server` from the same workstation and am attaching that log. If there's anything else I can collect in terms of "telemetry", please advise. Thanks!
Created attachment 1712383 [details] output of `ssh -vvv` while sssd_kcm socket is enabled
> I think 10-20 seconds spike may be normal, because KCM is not running in the background all the time. It is started only when there is some activity so this may be just kcm starting and serving the request. If this is in fact what's happening, it may be a good idea to revisit the decision to have KCM shut itself off. 10-20 seconds of CPU spike on a modern CPU is a *lot* of activity.
Gabriel, ssh is really trying Kerberos authentication (gssapi) before pubkey. debug1: Next authentication method: gssapi-with-mic debug1: Unspecified GSS failure. Minor code may provide more information Configuration file does not specify default realm debug1: Unspecified GSS failure. Minor code may provide more information Configuration file does not specify default realm Enable kcm logging like this: echo -e "\n[kcm]\ndebug_level=0x3ff0" >> /etc/sssd/sssd.conf chmod 600 /etc/sssd/sssd.conf systemctl restart sssd systemctl stop sssd-kcm Then reproduce the issue, /var/log/sssd/sssd_kcm.log should be populated.
Created attachment 1712821 [details] sssd_kcm.log collected while opening VM guest over ssh in virt-manager
(In reply to Gabriel Somlo from comment #42) > (In reply to Pavel Březina from comment #41) > > I think 10-20 seconds spike may be normal, because KCM is not running in the > > background all the time. It is started only when there is some activity so > > this may be just kcm starting and serving the request. > > Not only does kcm spike the cpu for 20-ish seconds when I try to open a VM > console in virt-manager, the vm console opening itself is delayed for > roughly the same time interval (i.e., it's waiting for sssd_kcm to spin its > wheels, then give up, before moving on with the connection). Something interesting I notice when making a connection to a remote QEMU/KVM instance, opening the VM guest console causes virt-manager to start several ssh processes one after the other to handle the qemu+ssh remote connection $ virt-manager -c 'qemu+ssh://root.test/system?keyfile=id_rsa' ## Wait for guest console to load/display $ ps -ef | egrep 'virt-manager|ssh' justin 8827 2496 3 16:08 ? 00:00:17 /usr/bin/python3 /usr/share/virt-manager/virt-manager justin 10126 8827 0 16:15 ? 00:00:00 ssh -l root -i id_rsa -T -e none -- fedora-32.jstephen.test sh -c 'if 'nc' -q 2>&1 | grep "requires an argument" >/dev/null 2>&1; then ARG=-q0;else ARG=;fi;'nc' $ARG -U /var/run/libvirt/libvirt-sock' justin 10316 8827 0 16:16 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' justin 10387 8827 0 16:17 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' justin 10423 8827 0 16:17 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' justin 10502 8827 0 16:17 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' justin 10524 8827 0 16:17 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' justin 10559 8827 0 16:17 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' justin 10575 8827 0 16:17 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' justin 10621 8827 0 16:17 ? 00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then CMD="nc -q 0 127.0.0.1 5900";else CMD="nc 127.0.0.1 5900";fi;eval "$CMD";' strace of sssd_kcm indicates that these ssh commands are the actual client sending requests through libkrb5 to KCM(as opposed to /usr/libexec/goa-identity-service in the earlier comments). I believe that each ssh command forked attempts client GSSAPI authentication resulting in KCM CPU utilization increase and an evident virt-manager delay. (In reply to Gabriel Somlo from comment #46) > Created attachment 1712821 [details] > sssd_kcm.log collected while opening VM guest over ssh in virt-manager One thing I don't quite understand from these provided logs is why KCM receives 900+ GET_CRED_BY_UUID operation requests in less than a 5 second duration. $ grep -irn '9:43:3[5-8].*GET_CRED_BY_UUID' sssd_kcm.log | grep -vi 'return' | wc -l 928
I can confirm Eric's comment 38, a single expired entry caused the process to spike up to 100%, kinit calmed it down. Not sure what. Even then the process keeps returning with 1-2% when the system is idling. Sounds like something is going on even on idle.
(In reply to jstephen from comment #47) > One thing I don't quite understand from these provided logs is why KCM > receives 900+ GET_CRED_BY_UUID operation requests in less than a 5 second > duration. > > $ grep -irn '9:43:3[5-8].*GET_CRED_BY_UUID' sssd_kcm.log | grep -vi 'return' > | wc -l > 928 We have seen this also in the GOA case (70k requests in two minutes). This is obviously what's spiking the cpu but we were never able to reproduce it. Perhaps the key to reproduce it is to let credentials expire. Justin, can you try it please?
Some Kerberos context if it's helpful: In both MIT krb5 and Heimdal, GET_CRED_BY_UUID is produced by krb5_cc_next_cred - in other words, during iteration. Simplified slightly, the code usually resembles: ```C krb5_cc_start_seq_get(context, cache, &cur); while ((ret = krb5_cc_next_cred(context, cache, &cur, &creds)) == 0) { do_something(creds); krb5_free_cred_contents(context, &creds); } krb5_cc_end_seq_get(context, cache, &cur); ``` Unless an internal error occurs, krb5 will just return the status code from KCM as the result code from krb5_cc_next_cred. If expired credentials are related, maybe the status codes in that case needs investigation?
(In reply to Pavel Březina from comment #49) > (In reply to jstephen from comment #47) > > One thing I don't quite understand from these provided logs is why KCM > > receives 900+ GET_CRED_BY_UUID operation requests in less than a 5 second > > duration. > > > > $ grep -irn '9:43:3[5-8].*GET_CRED_BY_UUID' sssd_kcm.log | grep -vi 'return' > > | wc -l > > 928 > > We have seen this also in the GOA case (70k requests in two minutes). This > is obviously what's spiking the cpu but we were never able to reproduce it. > Perhaps the key to reproduce it is to let credentials expire. Justin, can > you try it please? I have been unable to reproduce this, even with multiple expired TGTs in the ccache(setting the system date forward a day to force the ticket expiration) I am consistently seeing goa-identity-service triggering a refresh operation every 5 seconds on my fedora 32 system. (process:30194): libgoaidentity-DEBUG: 03:54:13.447: GoaKerberosIdentityManager: Refreshing identities (process:30194): libgoaidentity-DEBUG: 03:54:18.448: GoaKerberosIdentityManager: Refreshing identities (process:30194): libgoaidentity-DEBUG: 03:54:23.447: GoaKerberosIdentityManager: Refreshing identities (process:30194): libgoaidentity-DEBUG: 03:54:28.448: GoaKerberosIdentityManager: Refreshing identities (In reply to Simone Caronni from comment #34) > By then starting manually goa-identity-service like the following: > > $ G_MESSAGES_DEBUG=all /usr/libexec/goa-identity-service > > I get hundreds of lines per second on the screen with this entries: > > (process:62345): libgoaidentity-DEBUG: 16:21:41.900: > GoaKerberosIdentityManager: refreshing identity > 'simone.caronni' > (process:62345): libgoaidentity-DEBUG: 16:21:41.900: GoaKerberosIdentity: > expiration alarm fired for identity simone.caronni > (process:62345): libgoaidentity-DEBUG: 16:21:41.912: > GoaKerberosIdentityManager: refreshing identity > 'simone.caronni' > (process:62345): libgoaidentity-DEBUG: 16:21:41.913: GoaKerberosIdentity: > expiration alarm fired for identity simone.caronni > (process:62345): libgoaidentity-DEBUG: 16:21:41.914: > GoaKerberosIdentityManager: Waiting for next operation > (process:62345): libgoaidentity-DEBUG: 16:21:41.914: > GoaKerberosIdentityManager: Refreshing identities > (process:62345): libgoaidentity-DEBUG: 16:21:41.929: > GoaKerberosIdentityManager: refreshing identity > 'simone.caronni' > (process:62345): libgoaidentity-DEBUG: 16:21:41.930: > GoaKerberosIdentityManager: identity unexpired > (process:62345): libgoaidentity-DEBUG: 16:21:41.930: > GoaKerberosIdentityManager: identity 'simone.caronni' > refreshed > (process:62345): libgoaidentity-DEBUG: 16:21:41.932: GoaIdentityService: > credentials for account simone.caronni ensured for 73340 > seconds > (process:62345): libgoaidentity-DEBUG: 16:21:41.944: > GoaKerberosIdentityManager: refreshing identity > 'simone.caronni' > (process:62345): libgoaidentity-DEBUG: 16:21:41.944: GoaKerberosIdentity: > expiration alarm fired for identity simone.caronni > (process:62345): libgoaidentity-DEBUG: 16:21:41.946: GoaIdentityService: > identity simone.caronni expired > (process:62345): libgoaidentity-DEBUG: 16:21:41.948: GoaIdentityService: > could not ensure credentials for account simone.caronni: > GDBus.Error:org.gnome.OnlineAccounts.Error.NotAuthorized: No credentials > found in the keyring > (process:62345): libgoaidentity-DEBUG: 16:21:41.957: > GoaKerberosIdentityManager: refreshing identity > 'simone.caronni' > (process:62345): libgoaidentity-DEBUG: 16:21:41.957: GoaKerberosIdentity: > expiration alarm fired for identity simone.caronni > (process:62345): libgoaidentity-DEBUG: 16:21:41.967: > GoaKerberosIdentityManager: refreshing identity > 'simone.caronni' > (process:62345): libgoaidentity-DEBUG: 16:21:41.967: GoaKerberosIdentity: > expiration alarm fired for identity simone.caronni > (process:62345): libgoaidentity-DEBUG: 16:21:41.969: > GoaKerberosIdentityManager: Waiting for next operation > (process:62345): libgoaidentity-DEBUG: 16:21:41.969: > GoaKerberosIdentityManager: Refreshing identities The logs from this comment show a refresh happening much more frequently than 5 seconds. (process:62345): libgoaidentity-DEBUG: 16:21:41.914: GoaKerberosIdentityManager: Refreshing identities (process:62345): libgoaidentity-DEBUG: 16:21:41.969: GoaKerberosIdentityManager: Refreshing identities In gnome-online-accounts code, there are other ways that a refresh is triggered in addition to the 5 seconds polling timeout: 1 301 src/goaidentity/goakerberosidentitymanager.c <<on_identity_unexpired>> schedule_refresh (self); 2 327 src/goaidentity/goakerberosidentitymanager.c <<on_identity_needs_refresh>> schedule_refresh (self); 3 1320 src/goaidentity/goakerberosidentitymanager.c <<on_credentials_cache_changed>> schedule_refresh (self); 4 1326 src/goaidentity/goakerberosidentitymanager.c <<on_polling_timeout>> schedule_refresh (self); 5 1499 src/goaidentity/goakerberosidentitymanager.c <<goa_kerberos_identity_manager_initable_init>> schedule_refresh (self); The first of these callers may be related to this problem: (process:62345): libgoaidentity-DEBUG: 16:21:41.930: GoaKerberosIdentityManager: identity unexpired It would be great to see a larger sample of debug logs from gnome-identity-service when sssd_kcm is utilizing significant CPU. I hope that should give some clues to continue further investigation, or be able to reproduce this issue locally. # G_MESSAGES_DEBUG=all /usr/libexec/goa-identity-service
(In reply to Robbie Harwood from comment #50) > Some Kerberos context if it's helpful: > > In both MIT krb5 and Heimdal, GET_CRED_BY_UUID is produced by > krb5_cc_next_cred - in other words, during iteration. Simplified slightly, > the code usually resembles: > > ```C > krb5_cc_start_seq_get(context, cache, &cur); > while ((ret = krb5_cc_next_cred(context, cache, &cur, &creds)) == 0) { > do_something(creds); > krb5_free_cred_contents(context, &creds); > } > krb5_cc_end_seq_get(context, cache, &cur); > ``` > > Unless an internal error occurs, krb5 will just return the status code from > KCM as the result code from krb5_cc_next_cred. If expired credentials are > related, maybe the status codes in that case needs investigation? So if there is only one ticket in ccache krb5_cc_next_cred() should be called only once and the result code should be what precisely? KRB5_CC_END?
Almost. You're right that KRB5_CC_END is the correct code to stop iteration when there are no more credentials. (In practice, any nonzero value will stop iteration, since applications can't really do any recovery there anyway.) However, in your scenario, krb5_cc_next_cred() gets called twice: the first time it should succeed and return the ticket in the ccache, and the second time it should indicate completion by returning KRB5_CC_END.
There does not seem to be any problem with the return code since the end of the cursor is not tracked by KCM but by Kerberos. [1] I wanted to debug it with Lukas (comment #48) but unfortunately he can not reproduce it anymore. Gabriel, can you still reproduce it? If yes, could use please set debug_level=0xfff0 (more verbose than what you set previously) and attach new logs please? Would a remote debugging session (e.g. over tmate) be possible with you? [1] https://github.com/krb5/krb5/blob/master/src/lib/krb5/ccache/cc_kcm.c#L787
Also the output of klist -l and klist -A -C might be helpful to see how many credentials are in the ccache.
Created attachment 1714776 [details] output of klist
Created attachment 1714778 [details] sssd_kcm.log (xzip compressed) with debug_level=0xfff0 Pavel: if the stuff I attached doesn't shed any light on the problem, I'll set up a "clean" VM from the same kickstart I used with my desktop, and configure it to allow access to the same hypervisor I'm connecting to using virt-manager. I'll figure out a way to get you access to it to try for yourself. But it might take a few days, so hopefully the logs contain a better clue :) Thanks much, --Gabriel
Landed on this bug after a google search, so here are my two cents: Fedora 32, fresh install a while ago + regular dnf upgrades. Last week I started noticing my laptop fan spinning all the time and top showed sssd_kcm consuming 100% cpu memory. Rebooted the laptop on the weekend, kept it on yesterday, noticed it again today (Tuesday). Unfortunately I couldn't debug this further. I removed sssd_kcm given AFAIK it's not essential to my desktop use-case.
(In reply to Ademar Reis from comment #58) > Landed on this bug after a google search, so here are my two cents: Fedora > 32, fresh install a while ago + regular dnf upgrades. Last week I started > noticing my laptop fan spinning all the time and top showed sssd_kcm > consuming 100% cpu memory. Rebooted the laptop on the weekend, kept it on > yesterday, noticed it again today (Tuesday). s/cpu memory/cpu/
Thank you. I can reproduce high CPU load by artificially obtaining lots of service tickets. The problem is that the KCM procotol that is implemented by sssd-kcm is not very efficient and when a process needs to iterate over ccache there is one round trip required for every credential between the process and sssd-kcm. The more credentials there are the more time it takes for sssd-kcm to answer each requests. And to makes things worse gssapi which is used by ssh to try to connect to a host via Kerberos goes over each ccache serveral times (16 times if I remember that correctly) and Gabriel's particular situations requires around 5k requests to sssd-kcm (300 credentials * 16) and it creates this heavy load and delay. Also gnome-online-accounts polls ccache every 5 seconds which also takes some time for larger amount of credentials. I have identified at least two bottlenecks (ccache encryption and parsing json data). Justin is going to work on that. Gabriel, there are lot's of duplicate refresh_time control credentials which causes your ccache to grow. These are special type of information stored in ccache that is used by various mechanisms. However, there should be only one such record. I was not able to reproduce it and get more then one refresh time. Can you please describe how you use your Kerberos, how do you kinit? Can you check after what operation the number of entries in klist -C grows?
Pavel, All I really ever do with Kerberos is use it to occasionally update RPM packages I maintain in fedora: ``` kinit somlo ``` Over time, I ended up with all that cruft in the output of `klist -C` I showed you before, and the following in the output of just `klist` witout `-C`: ``` 08/09/2020 20:41:18 08/10/2020 20:39:03 HTTP/src.fedoraproject.org renew until 08/16/2020 20:39:03 config: fast_avail(krbtgt/FEDORAPROJECT.ORG) = yes 08/09/2020 20:39:06 08/10/2020 20:39:03 krbtgt/FEDORAPROJECT.ORG renew until 08/16/2020 20:39:03 config: pa_type(krbtgt/FEDORAPROJECT.ORG) = 151 08/09/2020 20:42:18 08/10/2020 20:39:03 HTTP/koji.fedoraproject.org renew until 08/16/2020 20:39:03 ``` Based on your last message, I dug around and found `kdestroy -A`. After running it, sssd-kcm now only goes up to 20% CPU for a few seconds while I bring up a remote VM over a `qemu+ssh://root@hypervisor` connection, not enough to start my laptop's fans whining :) I just re-ran `kinit somlo`, and all I see in `klist -C` is: ``` Ticket cache: KCM:1000 Default principal: somlo Valid starting Expires Service principal config: fast_avail(krbtgt/FEDORAPROJECT.ORG) = yes 09/23/2020 11:56:23 09/24/2020 11:56:19 krbtgt/FEDORAPROJECT.ORG renew until 09/30/2020 11:56:19 config: pa_type(krbtgt/FEDORAPROJECT.ORG) = 2 ``` So the way I see it, I was having two problems: 1. an accumulation of cruft in my ccache, which I cleaned up with `kdestroy -A` 2. the fact that sssd-kcm does suboptimal things when the cache grows unwieldy (which I now realize may be only *partially* its fault...) Armed with this knowledge, I will now re-enable sssd-kcm.socket on my system (I was only firing it up selectively during RPM maintenance jobs), and use `kdestroy -A` liberally whenever things get noticeably screwed up over time, which I will start noticing whenever I bring up remote VMs in virt-manager. Hope this makes sense, and gives you some useful hints as to what, if anything, may still be broken... Thanks, --Gabriel
Thanks, I will try to reproduce it with fedora services. If you don't require the tickets to outlive reboot, you can set ccache_storage = memory in [kcm], this is much faster but the tickets are not persistent.
Upstream ticket: https://github.com/SSSD/sssd/issues/5349
Upstream PR: https://github.com/SSSD/sssd/pull/5375
I'm hitting this as well. My laptop uses krb5 to authenticate to an el8 NFS server with autofs, and I have dozens of strange entries in root's cache: [root@phos ~]# klist -A Ticket cache: KCM:0:75962 Default principal: nfs/phos.ktdreyer.com Valid starting Expires Service principal 12/31/1969 17:00:00 12/31/1969 17:00:00 Encrypted/Credentials/v1@X-GSSPROXY: 12/31/1969 17:00:00 12/31/1969 17:00:00 Encrypted/Credentials/v1@X-GSSPROXY: 12/31/1969 17:00:00 12/31/1969 17:00:00 Encrypted/Credentials/v1@X-GSSPROXY: 12/31/1969 17:00:00 12/31/1969 17:00:00 Encrypted/Credentials/v1@X-GSSPROXY: [tons of identical entries snipped here] Killing the sssd_kcm PID or stopping the systemd service immediately spawns a new process with 100% CPU again. I've tried clearing every cache by running "kdestroy -A" as my normal UID and as root, with no effect. I tried creating /etc/sssd/sssd.conf with "ccache_storage = memory" in a "[kcm]" section, with no effect.
I had the same issues with gdm and console logins (not with ssh). In a freeIPA environment, the only host (FC33) with KCM hangs at login. I suppose it is the only host using KCM because it was the only one being installed from scratch with FC32, all the others were upgraded from previous FC releases. No GOA installed, KCM runs at 100%. The logs show hundreds of GET_CRED_BY_UUID requests. klist shows indeed hundreds of tickets: $ klist -A Ticket cache: KCM:1172000026:72700 Default principal: host/host1.lan.example.com.COM Valid starting Expires Service principal 01/01/1970 01:00:00 01/01/1970 01:00:00 Encrypted/Credentials/v1@X-GSSPROXY: 01/01/1970 01:00:00 01/01/1970 01:00:00 Encrypted/Credentials/v1@X-GSSPROXY: 01/01/1970 01:00:00 01/01/1970 01:00:00 Encrypted/Credentials/v1@X-GSSPROXY: 01/01/1970 01:00:00 01/01/1970 01:00:00 Encrypted/Credentials/v1@X-GSSPROXY: (hundreds of these) As these are about a host principal, I suppose they were generated by freeIPA, not by user logins. I don't quite understand in which situation they are generated, and why they are dated at epoch 0. I do hope PR https://github.com/SSSD/sssd/pull/5375 will improve the situation, but I remain ignorant about what causes the accumulation of (this kind of) tickets in the first place.
The entries you see in the KCM ccache are normal entries generated by the GSS-Proxy. They are encrypted and dates of the outer entry (of encrypted ticket) are fake, the real ones are inside these entries are accessible to gssproxy process. Nothing in FreeIPA actually generates these in KCM entries. It looks like you have some software that runs under UID 1172000026 and is configured to authenticate with a host keytab through the GSS-Proxy. At least, this is not a default for any FreeIPA-related configuration.
As Alexander says, however those entries should not pile up, every new entry should replace the previous one, so we need to investigate why this is not happening, and they are piling up instead.
Alexander, Simo, thanks for your clarifications, it helps. I will look for any process that might be trying to authenticate.
Roberto, generally the process is initiated by an NFS client mount and operated by rpc.gssd intercepted by gss-proxy.
Simo, good point. Indeed we have automounted home folders. But CephFS mounts, not NFS
About "piling up": I removed all existing entries with kdestroy -A (checked klist -A, it was empty). After restarting sssd-kcm and gdm (gdm was confused), the login from gdm worked perfectly, but immediately after that all entries were back there. That is, they didn't pile up over time, one login is enough.
Please disregard my last comment. It actually isn't true. After removing all entries it looks good, also after a login. I had issued the command as root (which apparently also needs some cleaning up).
Pushed PR: https://github.com/SSSD/sssd/pull/5375 * `master` * 325de5a5bb97ba026be6d22492bea8ab2605f1b5 - secrets: remove base64 enctype * 39277cdadd317b0ab86cdd37de0616bc3eecbe6a - secrets: move attrs names to macros * 9c1b51d057390fb5b26151f814a480911cda4cc9 - secrets: default to "plaintext" if "enctype" attr is missing * bf127d4f3f42e5b2afe25e512211439bc12a9904 - secrets: fix may_payload_size exceeded debug message * c3b314db57c34f64aaca7d74e76a9a955288bb51 - kcm: store credentials list in hash table to avoid cache lookups * a370553c90c2ed6df3b94c169c4960a6f978031f - sss_ptr_hash: fix double free for circular dependencies * 241ee30da12f564803793ee2b14c1522aabd9235 - kcm: add per-connection data to be shared between requests * 194447d35c11eb914f54719491dc5cfaab01b9a1 - kcm: use binary format to store ccache instead of json * f17740d831e16449495fff4ec57cc4800aaac83d - kcm: add spaces around operators in kcmsrv_ccache_key.c * 15069a647ed6c7f1ead42baa1d421d953c9bc557 - kcm: avoid suppression of cppcheck warning * e63a15038ac9c186626e4fdf681a6492031d1e40 - kcm: move sec key parser to separate file so it can be shared * 9b1631defdcaa3ea7e87889eb136e7fa935ab4ce - kcm: add json suffix to existing searialization functions * b6cc661b9f4162e590137430e945aa321fc13121 - iobuf: add more iobuf functions * ed08ba0023e63024bf1c52ae3f6596b9d804d0a5 - secrets: accept binary data instead of string * 908c15af9a9f8f0556a588e368e4a0b2e24ace1b - secrets: allow to specify secret's data format * 74fdaa64b27e88a6e0f153f8cb59989c572d4294 - kcm: avoid multiple debug messages if sss_sec_put fails * b8f28d9aa9d862cf504691c9c3f92941a63fb0a4 - kcm: disable encryption * 8edcea8c377e85d037e83065c1904fa4b92c4a39 - kcm: avoid name confusion in GET_CRED_UUID_LIST handlers * 47a316c850107f12d406f27abb216e26383dfab7 - kcm: fix typos in debug messages This patches improves the performance significantly, removing bottlenecks in sssd-kcm. However, it's still rather slow when compared to other ccache types. However, further improvements will require changes in KCM protocol which yet needs to be communicated and accepted by the protocol owners. I will make them available in Fedora shortly.
(In reply to Roberto Cornacchia from comment #75) > Please disregard my last comment. It actually isn't true. After removing all > entries it looks good, also after a login. I had issued the command as root > (which apparently also needs some cleaning up). If you manage to reproduce it, please open a separate bug so we can investigate.
Pavel, if you could share the Fedora's package version when it's available, I would be glad to help testing.
These patches are now available in Fedora Rawhide (sssd-2.4.0-5) and Fedora 33 (sssd-2.4.0-3; https://bodhi.fedoraproject.org/updates/FEDORA-2020-075cc71cb8). I will keep this BZ open to track further improvements in the KCM protocol.
Additional patch. * `master` * 18b98836ef8e337992f0ecb239a32b9c3cedb750 - kcm: decode base64 encoded secret on upgrade path
New update: https://bodhi.fedoraproject.org/updates/FEDORA-2020-b61158b5ff The previous update did not correctly handle having existing credentials in the ccache.
No FC22 package? I feel like if I upgrade to 33 it will not be an exact reproduction of the previous behavior.
Sorry, I mean to task for a FC32 package (not 22)
I didn't plan to include it in F32 because the changes are quite huge. However, it builds fine, you can try this scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=59111394 If it works well, I will create an update for it.
Thanks. I've saved this scratch build (with "koji download-task 59111394") to https://fedorapeople.org/~ktdreyer/bz1645624/koji-task-59111394/ in case koji-gc deletes it.
> People that want to revert to keyring for "performance" reasons, can simply dnf uninstall sssd-kcm Is that supposed to work for AD-enrolled machines? We have some enrolled RHEL 8 multi-user (20-50 users) machines. Changing /etc/krb5.conf.d/kcm_default_ccache to use KEYRING has no effect (kdestroy && su USER && klist still shows a KCM keyring). If I 'dnf remove sssd-kcm' password authentication is broken completely, 'su USER' just fails when given the correct password. Is there anything else that needs to be done after uninstalling sssd-kcm? Clearing the sssd cache or anything similar? I don't really mind that sssd-kcm burns 100% CPU on one core constantly, but after some uptime of the host it seems to be so overloaded that logins start to fail. I don't really know what in the login process depends on sssd-kcm, but when I tried to debug this problem, I saw endlessly respawning `krb5_child` processes. It seemed like the process was stuck (after "[krb5_child[21032]] [switch_creds] (0x0200): Switch user to ..."), sssd killed it due to a timeout, then restarted the process. (In our case it's probably not the goa daemon causing the 100% CPU issue though: We have already blacklisted it, as it spams spams the systemd log and makes the log unusable (https://bugzilla.redhat.com/show_bug.cgi?id=1601438)).
(In reply to Johannes Pfau from comment #86) > > People that want to revert to keyring for "performance" reasons, can simply dnf uninstall sssd-kcm > > Is that supposed to work for AD-enrolled machines? We have some enrolled Yes, definitely. > RHEL 8 multi-user (20-50 users) machines. Changing > /etc/krb5.conf.d/kcm_default_ccache to use KEYRING has no effect (kdestroy > && su USER && klist still shows a KCM keyring). If I 'dnf remove sssd-kcm' What does `echo $KRB5CCNAME` and `cat /etc/krb5.conf` say?
Thanks for the suggestions, here's some more information: > /etc/krb5.conf I didn't do the initial setup, so I don't know whether this was autogenerated by realm join or whether it has been altered manually... -------------------------------------------- includedir /etc/krb5.conf.d/ includedir /var/lib/sss/pubconf/krb5.include.d/ [logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] dns_lookup_realm = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true rdns = false default_ccache_name = KEYRING:persistent:%{uid} default_realm = KIT.EDU [realms] KIT.EDU = { } [domain_realm] kit.edu = KIT.EDU .kit.edu = KIT.EDU -------------------------------------------- > echo $KRB5CCNAME With sssd-kcm installed: As normal user, SSH / Public Key login: "Empty". After su USER: "KCM:" As root: "KCM:" If I uninstall sssd-kcm, I can't really get that information, as logins now fail. But if I comment out the lines in "/etc/krb5.conf.d/kcm_default_ccache", restart sssd and relogin, I still get "KCM:". Maybe rebooting the machine after uninstalling sssd-kcm would help, but I can't test that as there are still some active, long running sessions on that machine I don't want to interrupt. With sssd-kcm removed, /var/log/sssd/krb5_child.log contains this: -------------------------------------------- (2021-01-14 16:55:30): [krb5_child[1841552]] [become_user] (0x0200): Trying to become user [790849740][287600513]. (2021-01-14 16:55:31): [krb5_child[1841552]] [k5c_send_data] (0x0200): Received error code 1432158209 -------------------------------------------- When enabling verbose logging: -------------------------------------------- (2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x0400): krb5_child started. (2021-01-14 16:59:53): [krb5_child[1844742]] [unpack_buffer] (0x1000): total buffer size: [128] (2021-01-14 16:59:53): [krb5_child[1844742]] [unpack_buffer] (0x0100): cmd [241] uid [790849740] gid [287600513] validate [true] enterprise principal [true] offline [false] UPN [nq5949] (2021-01-14 16:59:53): [krb5_child[1844742]] [unpack_buffer] (0x0100): ccname: [KEYRING:persistent:790849740] old_ccname: [KCM:] keytab: [/etc/krb5.keytab] (2021-01-14 16:59:53): [krb5_child[1844742]] [check_use_fast] (0x0100): Not using FAST. (2021-01-14 16:59:53): [krb5_child[1844742]] [switch_creds] (0x0200): Switch user to [790849740][287600513]. (2021-01-14 16:59:53): [krb5_child[1844742]] [sss_open_ccache_as_user] (0x0400): ccache KCM: is missing or empty (2021-01-14 16:59:53): [krb5_child[1844742]] [switch_creds] (0x0200): Switch user to [0][0]. (2021-01-14 16:59:53): [krb5_child[1844742]] [old_ccache_valid] (0x0400): Saved ccache KCM: doesn't exist, ignoring (2021-01-14 16:59:53): [krb5_child[1844742]] [privileged_krb5_setup] (0x0080): Cannot open the PAC responder socket (2021-01-14 16:59:53): [krb5_child[1844742]] [become_user] (0x0200): Trying to become user [790849740][287600513]. (2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x2000): Running as [790849740][287600513]. (2021-01-14 16:59:53): [krb5_child[1844742]] [set_lifetime_options] (0x0100): No specific renewable lifetime requested. (2021-01-14 16:59:53): [krb5_child[1844742]] [set_lifetime_options] (0x0100): No specific lifetime requested. (2021-01-14 16:59:53): [krb5_child[1844742]] [set_canonicalize_option] (0x0100): Canonicalization is set to [true] (2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x0400): Will perform online auth (2021-01-14 16:59:53): [krb5_child[1844742]] [tgt_req_child] (0x1000): Attempting to get a TGT (2021-01-14 16:59:53): [krb5_child[1844742]] [get_and_save_tgt] (0x0400): Attempting kinit for realm [KIT.EDU] (2021-01-14 16:59:53): [krb5_child[1844742]] [sss_krb5_expire_callback_func] (0x2000): exp_time: [525869292] (2021-01-14 16:59:53): [krb5_child[1844742]] [validate_tgt] (0x2000): Found keytab entry with the realm of the credential. (2021-01-14 16:59:53): [krb5_child[1844742]] [validate_tgt] (0x0400): TGT verified using key for [ITIV-WORK5$@KIT.EDU]. (2021-01-14 16:59:53): [krb5_child[1844742]] [sss_send_pac] (0x0040): sss_pac_make_request failed [-1][2]. (2021-01-14 16:59:53): [krb5_child[1844742]] [validate_tgt] (0x0040): sss_send_pac failed, group membership for user with principal [nq5949\@KIT.EDU] might not be correct. (2021-01-14 16:59:53): [krb5_child[1844742]] [get_and_save_tgt] (0x2000): Running as [790849740][287600513]. (2021-01-14 16:59:53): [krb5_child[1844742]] [sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [111][Connection refused] (2021-01-14 16:59:53): [krb5_child[1844742]] [create_ccache] (0x0020): 991: [111][Connection refused] (2021-01-14 16:59:53): [krb5_child[1844742]] [map_krb5_error] (0x0020): 1833: [111][Connection refused] (2021-01-14 16:59:53): [krb5_child[1844742]] [k5c_send_data] (0x0200): Received error code 1432158209 (2021-01-14 16:59:53): [krb5_child[1844742]] [pack_response_packet] (0x2000): response packet size: [20] (2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x0400): krb5_child completed successfully -------------------------------------------- Even with sssd-kcm removed, I can sucessfully run kinit (as user, after login with SSH / Public Key): -------------------------------------------- kinit Password for nq5949: nq5949@itiv-work5 ~ % klist Ticket cache: KEYRING:persistent:790849740:790849740 Default principal: nq5949 Valid starting Expires Service principal 01/14/21 17:13:15 01/15/21 03:13:15 krbtgt/KIT.EDU renew until 01/21/21 17:13:12 -------------------------------------------- So kinit seems to acceptthese settings. But going through pam / su is broken for some reason. Here's the sssd.conf, for reference: -------------------------------------------- [sssd] domains = kit.edu config_file_version = 2 services = nss, pam [nss] filter_groups = root filter_users = root enum_cache_timeout = 86400 entry_cache_nowait_percentage = 10 [domain/kit.edu] ad_domain = kit.edu krb5_realm = KIT.EDU realmd_tags = manages-system joined-with-adcli cache_credentials = True id_provider = ad krb5_store_password_if_offline = True default_shell = /bin/bash ldap_id_mapping = True use_fully_qualified_names = False override_homedir = /home/%u fallback_homedir = /home/%u@%d create_homedir = false access_provider = simple auth_provider = ad simple_allow_groups = [...] dyndns_update = False ignore_group_members = False ldap_use_tokengroups = False ldap_group_nesting_level = 0 entry_cache_timeout = 86400 entry_cache_user_timeout = 86400 entry_cache_group_timeout = 86400 entry_cache_netgroup_timeout = 86400 refresh_expired_interval = 3600 lookup_family_order = ipv4_only #debug_level = 0x200 debug_level = 0x3ff0 --------------------------------------------
So after uninstalling sssd-kcm and rebooting, the error message in krb5_child.log changed from `[sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [111][Connection refused]` to `[sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [-1750600181][No KCM server found]`. After some more tests, it seems sssd caches the `default_ccache_name` somewhere, somehow. sssd_cache -E did not seem to solve the problem, but after I deleted `/var/lib/sss/db/*`, password logins started to work again. So to summarize: * Remove sssd-kcm * Stop sssd * ssd_cache -E * rm -rf /var/lib/sss/db/* * Start sssd I think the reasons logins started to fail with sssd-kcm is also the krb5_cc_cache_match call in krb5_child: If sssd-kcm is overloaded for some reason and does not respond early enough, the krb5_child task will timeout and the login will fail. But that's just a guess, I didn't debug that case further.
SSSD should be able to handle this change, at least when new user session is created. If not then it's a bug, I will test this.
I have the same problem coming back on F33 as well with sssd 2.4.2-2.fc33 I installed debuginfo and ran few pstack runs: Installed: sssd-debuginfo-2.4.2-2.fc33.x86_64 sssd-debugsource-2.4.2-2.fc33.x86_64 sssd-kcm-debuginfo-2.4.2-2.fc33.x86_64 Complete! [root@pinega ~]# pstack $(pidof sssd_kcm) #0 0x00007f70b0c659f5 in tdb_next_hash_chain () from /lib64/libtdb.so.1 #1 0x00007f70b0c69dc6 in tdb_next_lock () from /lib64/libtdb.so.1 #2 0x00007f70b0c69ffe in tdb_traverse_internal () from /lib64/libtdb.so.1 #3 0x00007f70b0c6a22d in tdb_traverse_read () from /lib64/libtdb.so.1 #4 0x00007f70a31c9cdd in ltdb_traverse_fn () from /usr/lib64/ldb/libldb-tdb-int.so #5 0x00007f70a3001751 in ldb_kv_search () from /usr/lib64/ldb/libldb-key-value.so #6 0x00007f70a3003288 in ldb_kv_callback () from /usr/lib64/ldb/libldb-key-value.so #7 0x00007f70b12f9939 in tevent_common_invoke_timer_handler () from /lib64/libtevent.so.0 #8 0x00007f70b12f9adc in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0 #9 0x00007f70b12fa2c8 in epoll_event_loop_once () from /lib64/libtevent.so.0 #10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #12 0x00007f70b0cecc3b in ldb_wait () from /lib64/libldb.so.2 #13 0x00007f70b0cee114 in ldb_search () from /lib64/libldb.so.2 #14 0x00007f70b134215b in sss_sec_list (_num_keys=0x7ffeeb742dc0, _keys=0x7ffeeb742dc8, req=0x56195f13a050, mem_ctx=0x56195f13fc30) at src/util/secrets/secrets.c:964 #15 sss_sec_list (mem_ctx=0x56195f13fc30, req=0x56195f13a050, _keys=0x7ffeeb742dc8, _num_keys=0x7ffeeb742dc0) at src/util/secrets/secrets.c:940 #16 0x000056195ebd1737 in key_by_uuid (mem_ctx=0x56195f13fbc0, sctx=0x56195f13a860, client=<optimized out>, uuid=0x56195fb58110 "\205\257\361%\244\066J\221\263=\356݄F\004QB\202\030\020~\303H!\233\063#ɃH^\353\020\310!\a\005\342N\004\202a\277\300\244\006O6\312\374\025\234\275lOY\271\065`\262\212\\m\372[\300\242\274\250jK\354\214.\222b\271\271\022\f\326\364b,J\332N\272\241\250wH\302&", _key=0x7ffeeb742e68) at src/responder/kcm/kcmsrv_ccache_secdb.c:362 #17 0x000056195ebd19b5 in ccdb_secdb_name_by_uuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=<optimized out>, client=0x561960551710, uuid=0x56195fb58110 "\205\257\361%\244\066J\221\263=\356݄F\004QB\202\030\020~\303H!\233\063#ɃH^\353\020\310!\a\005\342N\004\202a\277\300\244\006O6\312\374\025\234\275lOY\271\065`\262\212\\m\372[\300\242\274\250jK\354\214.\222b\271\271\022\f\326\364b,J\332N\272\241\250wH\302&") at src/responder/kcm/kcmsrv_ccache_secdb.c:1046 #18 0x000056195ebd5d02 in kcm_ccdb_name_by_uuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=0x56195f13a980, client=0x561960551710, uuid=0x56195fb58110 "\205\257\361%\244\066J\221\263=\356݄F\004QB\202\030\020~\303H!\233\063#ɃH^\353\020\310!\a\005\342N\004\202a\277\300\244\006O6\312\374\025\234\275lOY\271\065`\262\212\\m\372[\300\242\274\250jK\354\214.\222b\271\271\022\f\326\364b,J\332N\272\241\250wH\302&") at src/responder/kcm/kcmsrv_ccache.c:952 #19 0x000056195ebdbe47 in kcm_op_get_default_ccache_list_done (subreq=0x0) at src/responder/kcm/kcmsrv_ops.c:1787 #20 0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0 #21 0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0 #22 0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0 #23 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #24 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #25 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #26 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0 #27 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so #28 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332 [root@pinega ~]# pstack $(pidof sssd_kcm) #0 0x00007f70b114d6ae in epoll_ctl () from /lib64/libc.so.6 #1 0x00007f70b12f57d7 in epoll_update_event () from /lib64/libtevent.so.0 #2 0x00007f70b12f61e4 in epoll_event_set_fd_flags () from /lib64/libtevent.so.0 #3 0x000056195ebcb65e in kcm_send_reply (req_ctx=0x56195fb5f3e0) at src/responder/kcm/kcmsrv_cmd.c:366 #4 kcm_cmd_request_done (req=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:415 #5 0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0 #6 0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0 #7 0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0 #8 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #9 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #10 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #11 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0 #12 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so #13 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332 [root@pinega ~]# pstack $(pidof sssd_kcm) #0 0x00007f70b11b0a71 in __memset_avx2_erms () from /lib64/libc.so.6 #1 0x00007f70b12e3584 in _talloc_zero () from /lib64/libtalloc.so.2 #2 0x000056195ebd96bd in sss_iobuf_init_empty (mem_ctx=<optimized out>, size=10485760, capacity=10485760) at src/util/sss_iobuf.c:33 #3 0x000056195ebcd3dd in kcm_cmd_send (op=0x56195ebed200 <kcm_optable+480>, input=0x56195f139ed8, client=0x56195fb466e0, conn_data=0x56195fb5af70, kcm_data=<optimized out>, qctx=<optimized out>, ev=0x56195f12bfd0, mem_ctx=0x56195f139e30) at src/responder/kcm/kcmsrv_ops.c:117 #4 kcm_cmd_dispatch (req_ctx=0x56195f139e30, kctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:381 #5 kcm_recv (cctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:553 #6 sss_client_fd_handler (recv_fn=<optimized out>, send_fn=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:1097 #7 kcm_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:628 #8 0x00007f70b12f6a4d in tevent_common_invoke_fd_handler () from /lib64/libtevent.so.0 #9 0x00007f70b12fa4e7 in epoll_event_loop_once () from /lib64/libtevent.so.0 #10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #12 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #13 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0 #14 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so #15 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332 [root@pinega ~]# pstack $(pidof sssd_kcm) #0 0x00007f70b11b0a71 in __memset_avx2_erms () from /lib64/libc.so.6 #1 0x00007f70b12e3584 in _talloc_zero () from /lib64/libtalloc.so.2 #2 0x000056195ebd96bd in sss_iobuf_init_empty (mem_ctx=<optimized out>, size=10485752, capacity=10485752) at src/util/sss_iobuf.c:33 #3 0x000056195ebcd4b4 in kcm_cmd_send (op=<optimized out>, input=0x56195fb5f488, client=0x56195f13ff40, conn_data=<optimized out>, kcm_data=<optimized out>, qctx=<optimized out>, ev=0x56195f12bfd0, mem_ctx=0x56195fb5f3e0) at src/responder/kcm/kcmsrv_ops.c:161 #4 kcm_cmd_dispatch (req_ctx=0x56195fb5f3e0, kctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:381 #5 kcm_recv (cctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:553 #6 sss_client_fd_handler (recv_fn=<optimized out>, send_fn=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:1097 #7 kcm_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:628 #8 0x00007f70b12f6a4d in tevent_common_invoke_fd_handler () from /lib64/libtevent.so.0 #9 0x00007f70b12fa4e7 in epoll_event_loop_once () from /lib64/libtevent.so.0 #10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #12 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #13 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0 #14 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so #15 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332 [root@pinega ~]# pstack $(pidof sssd_kcm) #0 0x00007f70b11b0666 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6 #1 0x00007f70b0c67ad1 in tdb_read () from /lib64/libtdb.so.1 #2 0x00007f70b0c6a041 in tdb_traverse_internal () from /lib64/libtdb.so.1 #3 0x00007f70b0c6a22d in tdb_traverse_read () from /lib64/libtdb.so.1 #4 0x00007f70a31c9cdd in ltdb_traverse_fn () from /usr/lib64/ldb/libldb-tdb-int.so #5 0x00007f70a3001751 in ldb_kv_search () from /usr/lib64/ldb/libldb-key-value.so #6 0x00007f70a3003288 in ldb_kv_callback () from /usr/lib64/ldb/libldb-key-value.so #7 0x00007f70b12f9939 in tevent_common_invoke_timer_handler () from /lib64/libtevent.so.0 #8 0x00007f70b12f9adc in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0 #9 0x00007f70b12fa2c8 in epoll_event_loop_once () from /lib64/libtevent.so.0 #10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #12 0x00007f70b0cecc3b in ldb_wait () from /lib64/libldb.so.2 #13 0x00007f70b0cee114 in ldb_search () from /lib64/libldb.so.2 #14 0x00007f70b134215b in sss_sec_list (_num_keys=0x7ffeeb742d70, _keys=0x7ffeeb742d78, req=0x56195fb67ff0, mem_ctx=0x56195f13fc30) at src/util/secrets/secrets.c:964 #15 sss_sec_list (mem_ctx=0x56195f13fc30, req=0x56195fb67ff0, _keys=0x7ffeeb742d78, _num_keys=0x7ffeeb742d70) at src/util/secrets/secrets.c:940 #16 0x000056195ebd1737 in key_by_uuid (mem_ctx=0x56195f13fbc0, sctx=0x56195f13a860, client=<optimized out>, uuid=0x7ffeeb742e80 "\205\257\361%\244\066J\221\263=\356݄F\004Q\300\360\023_\031V", _key=0x7ffeeb742e18) at src/responder/kcm/kcmsrv_ccache_secdb.c:362 #17 0x000056195ebd2a6e in ccdb_secdb_getbyuuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=<optimized out>, client=0x56195f13f710, uuid=0x7ffeeb742e80 "\205\257\361%\244\066J\221\263=\356݄F\004Q\300\360\023_\031V") at src/responder/kcm/kcmsrv_ccache_secdb.c:922 #18 0x000056195ebd9c05 in kcm_ccdb_getbyuuid_send (uuid=0x7ffeeb742e80 "\205\257\361%\244\066J\221\263=\356݄F\004Q\300\360\023_\031V", client=0x56195f13f710, db=0x56195f13a980, ev=0x56195f12bfd0, mem_ctx=<optimized out>) at src/responder/kcm/kcmsrv_ccache.c:859 #19 kcm_op_get_cache_by_uuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, op_ctx=0x56195fb4f1d0) at src/responder/kcm/kcmsrv_ops.c:1562 #20 0x000056195ebcc03a in kcm_cmd_queue_done (subreq=0x0) at src/responder/kcm/kcmsrv_ops.c:201 #21 0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0 #22 0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0 #23 0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0 #24 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #25 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #26 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #27 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0 #28 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so #29 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332 [root@pinega ~]# pstack $(pidof sssd_kcm) #0 0x00007f70b1119dfb in geteuid () from /lib64/libc.so.6 #1 0x00007f70b1341c89 in sss_sec_new_req (_req=0x7ffeeb742dd0, client=0, url=<optimized out>, sec_ctx=0x56195f13a860, mem_ctx=0x56195f13f1f0) at src/util/secrets/secrets.c:865 #2 sss_sec_new_req (mem_ctx=0x56195f13f1f0, sec_ctx=0x56195f13a860, url=<optimized out>, client=0, _req=0x7ffeeb742dd0) at src/util/secrets/secrets.c:808 #3 0x000056195ebd128e in secdb_dfl_url_req (mem_ctx=0x56195f13f180, sctx=0x56195f13a860, client=<optimized out>, _sreq=0x7ffeeb742e28) at src/responder/kcm/kcmsrv_ccache_secdb.c:326 #4 0x000056195ebd13ce in ccdb_secdb_get_default_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=<optimized out>, client=0x56196054daa0) at src/responder/kcm/kcmsrv_ccache_secdb.c:775 #5 0x000056195ebd4cf3 in kcm_ccdb_get_default_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=0x56195f13a980, client=0x56196054daa0) at src/responder/kcm/kcmsrv_ccache.c:546 #6 0x000056195ebd7c75 in kcm_op_get_default_ccache_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, op_ctx=<optimized out>) at src/responder/kcm/kcmsrv_ops.c:1657 #7 0x000056195ebcc03a in kcm_cmd_queue_done (subreq=0x0) at src/responder/kcm/kcmsrv_ops.c:201 #8 0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0 #9 0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0 #10 0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0 #11 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #12 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #13 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #14 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0 #15 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so #16 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332 [root@pinega ~]# pstack $(pidof sssd_kcm) #0 0x00007f70b11b0a71 in __memset_avx2_erms () from /lib64/libc.so.6 #1 0x00007f70b12e3584 in _talloc_zero () from /lib64/libtalloc.so.2 #2 0x000056195ebd96bd in sss_iobuf_init_empty (mem_ctx=<optimized out>, size=10485760, capacity=10485760) at src/util/sss_iobuf.c:33 #3 0x000056195ebcd3dd in kcm_cmd_send (op=0x56195ebed1e8 <kcm_optable+456>, input=0x56195fb68da8, client=0x56195fb6a7d0, conn_data=0x5619605487c0, kcm_data=<optimized out>, qctx=<optimized out>, ev=0x56195f12bfd0, mem_ctx=0x56195fb68d00) at src/responder/kcm/kcmsrv_ops.c:117 #4 kcm_cmd_dispatch (req_ctx=0x56195fb68d00, kctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:381 #5 kcm_recv (cctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:553 #6 sss_client_fd_handler (recv_fn=<optimized out>, send_fn=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:1097 #7 kcm_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:628 #8 0x00007f70b12f6a4d in tevent_common_invoke_fd_handler () from /lib64/libtevent.so.0 #9 0x00007f70b12fa4e7 in epoll_event_loop_once () from /lib64/libtevent.so.0 #10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0 #11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0 #12 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0 #13 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0 #14 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so #15 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332
Further performance improvements where just pushed upstream: sssd: https://github.com/SSSD/sssd/pull/5546 krb5: https://github.com/krb5/krb5/pull/1165 I will coordinate with Kerberos maintainer to get these changes in, since it requires changes in both sssd and krb5.
*** Bug 1945966 has been marked as a duplicate of this bug. ***
This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Created attachment 1782776 [details] Strace attached to sssd_kcm on an idle system. Idle system strace histogram.
Created attachment 1782777 [details] Strace attached to sssd_kcm on a system with ssh establishment. strace histogram when x4 ssh sessions are being established.
I came across this issue with F33 using sssd_kcm 2.4.2. Have been able to re-create the "single-core high consumption" issue repeatedly. klist shows 1 expired KCM entry. $ klist -l Principal name Cache name -------------- ---------- xxxxxx.xxxxx KCM:1000 (Expired) $ I noticed on an old but not slow system the CPU consumption by sssd_kcm was pinned at 100% when I was establishing an ssh session. Taking 35 seconds. Which is bad enough because usually this completes in under 3 seconds. This decrease in performance has recently occured since installing and using kinit on a project. Then I ran this test: In 4 separate terminal windows I started a new ssh session to x4 separate hosts at the same time. This test was revealing. A ssh session establishment duration increased to 1 minute 40 seconds. Attached are x2 strace histograms for a measurement period of approx 1 minute 40 secs. The first is sssd_kcm with an idle system. The second is when the x4 ssh sessions are being established. Comparing each shows a stark difference in the number of system calls. $ uname -a Linux burtha-f33 5.8.15-301.fc33.x86_64 #1 SMP Thu Oct 15 16:58:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux $ lscpu|grep name Model name: Intel(R) Core(TM) i7 CPU X 990 @ 3.47GHz $
*** Bug 1844982 has been marked as a duplicate of this bug. ***
sssd v2.5.0 includes the patch. Not found a release of krb5 yet that includes the merged patch. I could be wrong of course. @Pavel Březina how is the discussion going with the maintainers to get the krb5 patch released ?
F34 and Rawhide: 1.19.1-6 The maintainer does not want to add it to Fedora 33.
I'm seeing an issue similar to https://bugzilla.redhat.com/show_bug.cgi?id=1945966 This is filling up my logs on a regular basis with sssd_kcm backtrace dumps Looking at the process with strace it appears to be interacting with "/usr/libexec/goa-identity-service" for user 500 I do have a couple of expired logins in Gnome Online Accounts, but I wouldn't expect /var/log to be filled up with SSSD backtraces. (2021-06-09 12:34:21): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory ********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE: * (2021-06-09 12:34:21): [kcm] [ccdb_secdb_list_send] (0x2000): Listing all ccaches * (2021-06-09 12:34:21): [kcm] [sss_sec_map_path] (0x1000): Mapping prefix /kcm/ * (2021-06-09 12:34:21): [kcm] [kcm_map_url_to_path] (0x1000): User-specific KCM path is [/kcm/persistent/500/ccache/] * (2021-06-09 12:34:21): [kcm] [local_db_dn] (0x2000): Local path for [persistent/500/ccache/] is [cn=ccache,cn=500,cn=persistent,cn=kcm] * (2021-06-09 12:34:21): [kcm] [sss_sec_new_req] (0x1000): Local DB path is persistent/500/ccache/ * (2021-06-09 12:34:21): [kcm] [secdb_container_url_req] (0x2000): Created request for URL /kcm/persistent/500/ccache/ * (2021-06-09 12:34:21): [kcm] [sss_sec_list] (0x0400): Listing keys at [persistent/500/ccache/] * (2021-06-09 12:34:21): [kcm] [sss_sec_list] (0x2000): Searching for [(|(type=simple)(type=binary))] at [cn=ccache,cn=500,cn=persistent,cn=kcm] with scope=subtree * (2021-06-09 12:34:21): [kcm] [sss_sec_list] (0x1000): No secrets found * (2021-06-09 12:34:21): [kcm] [ccdb_secdb_list_send] (0x2000): Found 0 ccaches * (2021-06-09 12:34:21): [kcm] [ccdb_secdb_list_send] (0x2000): Listing all caches done * (2021-06-09 12:34:21): [kcm] [kcm_op_get_default_ccache_reply_step] (0x2000): The default ccache is 500 * (2021-06-09 12:34:21): [kcm] [kcm_cmd_done] (0x0400): KCM operation GET_DEFAULT_CACHE returned [0]: Success * (2021-06-09 12:34:21): [kcm] [kcm_send_reply] (0x2000): Sending a reply * (2021-06-09 12:34:21): [kcm] [kcm_output_construct] (0x1000): Sending a reply with 8 bytes of payload * (2021-06-09 12:34:21): [kcm] [queue_removal_cb] (0x0200): Removed queue for 500 * (2021-06-09 12:34:21): [kcm] [kcm_send] (0x2000): All data sent! * (2021-06-09 12:34:21): [kcm] [get_client_cred] (0x4000): Client [0x563783a5e1b0][15] creds: euid[500] egid[500] pid[3704] cmd_line['krb5-auth-dialog']. * (2021-06-09 12:34:21): [kcm] [setup_client_idle_timer] (0x4000): Idle timer re-set for client [0x563783a5e1b0][15] * (2021-06-09 12:34:21): [kcm] [accept_fd_handler] (0x0400): Client [0x563783a5e1b0][15] connected! * (2021-06-09 12:34:21): [kcm] [kcm_input_parse] (0x1000): Received message with length 4 * (2021-06-09 12:34:21): [kcm] [kcm_get_opt] (0x2000): The client requested operation 20 * (2021-06-09 12:34:21): [kcm] [kcm_cmd_send] (0x0400): KCM operation GET_DEFAULT_CACHE * (2021-06-09 12:34:21): [kcm] [kcm_cmd_send] (0x1000): 0 bytes on KCM input * (2021-06-09 12:34:21): [kcm] [kcm_op_queue_send] (0x0200): Adding request by 500 to the wait queue * (2021-06-09 12:34:21): [kcm] [kcm_op_queue_get] (0x1000): No existing queue for this ID * (2021-06-09 12:34:21): [kcm] [kcm_op_queue_send] (0x1000): Queue was empty, running the request immediately * (2021-06-09 12:34:21): [kcm] [kcm_op_get_default_ccache_send] (0x1000): Getting client's default ccache * (2021-06-09 12:34:21): [kcm] [ccdb_secdb_get_default_send] (0x2000): Getting the default ccache * (2021-06-09 12:34:21): [kcm] [sss_sec_map_path] (0x1000): Mapping prefix /kcm/ * (2021-06-09 12:34:21): [kcm] [kcm_map_url_to_path] (0x1000): User-specific KCM path is [/kcm/persistent/500/default] * (2021-06-09 12:34:21): [kcm] [local_db_dn] (0x2000): Local path for [persistent/500/default] is [cn=default,cn=500,cn=persistent,cn=kcm] * (2021-06-09 12:34:21): [kcm] [sss_sec_new_req] (0x1000): Local DB path is persistent/500/default * (2021-06-09 12:34:21): [kcm] [secdb_dfl_url_req] (0x2000): Created request for URL /kcm/persistent/500/default * (2021-06-09 12:34:21): [kcm] [sss_sec_get] (0x0400): Retrieving a secret from [persistent/500/default] * (2021-06-09 12:34:21): [kcm] [sss_sec_get] (0x2000): Searching for [(|(type=simple)(type=binary))] at [cn=default,cn=500,cn=persistent,cn=kcm] with scope=base * (2021-06-09 12:34:21): [kcm] [sss_sec_get] (0x1000): No secret found * (2021-06-09 12:34:21): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory ********************** BACKTRACE DUMP ENDS HERE ********************************* I did have an expired krb5 credential as user 500 which I've now removed with kdestroy.
We just released sssd-2.6.0 that contains a KCM fix to make sure that ccache does not pile up with expired tickets. In particular this change: "KCM now replace the old credential with new one when storing an updated credential that is however already present in the ccache to avoid unnecessary growth of the ccache." https://github.com/SSSD/sssd/commit/55c5de2d5c84721ca733e2b593894ffe904cdfb4 I believe this is the last thing that lead to performance issues. It is currently in rawhide and F35 update https://bodhi.fedoraproject.org/updates/FEDORA-2021-360425682d. I will also backport it to F34 once it gets some testing. I'm closing this BZ as rawhide. Feel free to open new BZ if you hit another performance issues.
I've just installed sssd-kcm-2.6.0-1.fc35.x86_64, rebooted, logged into GNOME and in top sssd_kcm is sitting at 93% CPU, with goa-identity-service at 16%. Is there another element I need to upgrade to see the fix?
How many tickets are there in the output of 'klist -A -C'? It is possible that your ccache is still large (with expired credentials piled up) and you need to kdestroy && kinit.
I'm really not convinced this is fixed. I've cleaned the cache with kdestroy and logged back into GNOME. For the first few days things are fine, but now I've got (from top): 3493 root 20 0 270396 40164 8220 R 97.3 0.3 1506:04 sssd_kcm goa-identity-service seems to be waking up frequently. There's not much in klist -A -C: Ticket cache: KCM:1367600004:74262 Default principal: james Valid starting Expires Service principal 22/10/21 21:30:24 23/10/21 21:23:36 krbtgt/CB.ETTLE 22/10/21 21:30:53 23/10/21 21:23:36 host/sparky.cb.ettle Ticket cache: KCM:1367600004 Default principal: james Valid starting Expires Service principal 20/10/21 22:39:24 21/10/21 21:49:09 krbtgt/CB.ETTLE
James, can you please attach KCM logs? See https://bugzilla.redhat.com/show_bug.cgi?id=1645624#c41 how to obtain them. GOA will still poll kcm every 5 seconds so there still will be some activity until we implement notifications. But 97% of CPU activity seems strange with so small ccache.
Created attachment 1837017 [details] Last 1000 lines of level-0x3ff0 sssd_kcm.log for sssd-kcm-2.6.0-1.fc35 I reinstalled sssd-kcm and restarted with logging configured as described in comment 41. sssd-kcm was again running with 97-98% CPU, klist -A -C containing the same number of entries. Within a few minutes, the log file was ~170 MiB big so I've only attached the last 1000 lines.
KCM is behaving correctly at this point and the overall performance is good, I think. However, your logs show that there are multiple goa request in the single seconds (4-5 in this log), each request triggers multiple KCM operations (which is expected). We definitely want to implement notification mechanism that will prevent goa from constant polling, however in my opinion we should see only single request every five seconds, not multiple requests each second. Please attach a full log, if you will zip it it should shrink significantly.
Created attachment 1837187 [details] level-0x3ff0 sssd_kcm.log for sssd-kcm-2.6.0-1.fc35 Full debug log attached.
Thank you James. 22:32:22 - looks like you logged in to the system, since krb5_child from sssd that is responsible for obtaining tgt for you talks to kcm (idle) 22:32:27 - goa kicks in, this is the first poll from goa, it finishes fast and looks correct (idle) 22:32:32 - another poll from goa, but it starts an unending communication between goa and kcm which goes to the end of logs at 22:34:27, goa sends more and more requests and kcm never becomes idle again. This generates the cpu load. Derbarshi, can you please advice how to get logs from goa? Can you point me to relevant code in goa?
You might already be seeing the logs in the systemd journal. To increase the logging level, you can set the G_MESSAGES_DEBUG environment variable to 'all', and restart the GOA daemons. See: https://wiki.gnome.org/Projects/GnomeOnlineAccounts/Debugging In this case, I suspect the goa-identity-service process. GOA is a relatively small code-base. You could start from here: https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/tree/master/src/goaidentity
Uhm, I see we get to refresh_identities from poll timeout but I can't really follow without deeper understanding of the code since there is too much goa internals involved. James, can you please try to get the goa logs?
I have had this happen to me a couple of times. I filed https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79 against GOA and did a bit of investigation. It looks like GOA makes the assumption that you'll only ever have a single ccache for each principal in a ccache collection; when this assumption does not hold, GOA gets confused and starts spamming the KCM with requests.
(In reply to Sam Morris from comment #113) > I have had this happen to me a couple of times. I filed > https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79 against GOA > and did a bit of investigation. It looks like GOA makes the assumption that > you'll only ever have a single ccache for each principal in a ccache > collection; when this assumption does not hold, GOA gets confused and starts > spamming the KCM with requests. Well at least now we know who the culprit is, thanks for the pointer!
Thank you for looking into it.
See https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79#note_1603310 for a reproducer. I suggest re-opening issue 1844982, or reopening this issue and reassigning it to gnome-online-accounts.
I seem to be on a NEEDINFO for this but as of a few months now I've not been seeing any excessive CPU drain from sssd-kcm.