Bug 1645624 - sssd_kcm eats CPU (being continuously polled by GOA)
Summary: sssd_kcm eats CPU (being continuously polled by GOA)
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: sssd
Version: 34
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
Assignee: Pavel Březina
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: sync-to-jira review
: 1670451 1844982 1945966 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-02 16:42 UTC by Alberto Ruiz
Modified: 2023-05-12 13:40 UTC (History)
41 users (show)

Fixed In Version: sssd-2.6.0-1.fc35
Clone Of:
Environment:
Last Closed: 2021-10-15 12:38:26 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Screenshot of sssd_kcm showing up in top on a vanilla F29 Workstation (1.08 MB, image/png)
2018-11-07 11:06 UTC, Alberto Ruiz
no flags Details
output of `ssh -vvv` while sssd_kcm socket is enabled (15.81 KB, text/plain)
2020-08-24 15:11 UTC, Gabriel Somlo
no flags Details
sssd_kcm.log collected while opening VM guest over ssh in virt-manager (27.33 KB, application/x-xz)
2020-08-27 13:49 UTC, Gabriel Somlo
no flags Details
output of klist (11.38 KB, text/plain)
2020-09-14 13:20 UTC, Gabriel Somlo
no flags Details
sssd_kcm.log (xzip compressed) with debug_level=0xfff0 (135.04 KB, application/x-xz)
2020-09-14 13:23 UTC, Gabriel Somlo
no flags Details
Strace attached to sssd_kcm on an idle system. (1.38 KB, text/plain)
2021-05-13 14:12 UTC, Jeremy Whiting
no flags Details
Last 1000 lines of level-0x3ff0 sssd_kcm.log for sssd-kcm-2.6.0-1.fc35 (116.90 KB, text/plain)
2021-10-25 21:43 UTC, James
no flags Details
level-0x3ff0 sssd_kcm.log for sssd-kcm-2.6.0-1.fc35 (332.25 KB, application/x-xz)
2021-10-26 13:17 UTC, James
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github SSSD sssd issues 5349 0 None closed kcm: poor performance with large number of credentials 2021-02-16 07:38:20 UTC

Description Alberto Ruiz 2018-11-02 16:42:17 UTC
Given that sssd lacks a proper cache changes notification system that clients can use to know when to gather data, clients like GOA need to poll sssd constantly.

When using the KCM backend we've noticed that this has had a considerable impact on memory and CPU usage (and therefore battery usage on laptops).

I suggest we switch to "default_ccache_name = KEYRING:" in /etc/krb5.conf.d/kcm_default_ccache until #1551648 is resolved.

Comment 1 Simo Sorce 2018-11-02 17:01:17 UTC
People that want to revert to keyring for "performance" reasons, can simply dnf uninstall sssd-kcm, I would rather address why GOA is hammering on thw caches so much that it becomes a performance issue.

Comment 2 Alberto Ruiz 2018-11-02 17:27:25 UTC
(In reply to Simo Sorce from comment #1)
> People that want to revert to keyring for "performance" reasons, can simply
> dnf uninstall sssd-kcm, I would rather address why GOA is hammering on thw
> caches so much that it becomes a performance issue.

A bit unfair, first, all users are impacted by this, whether they notice or care is another matter. We could also argue that people who use kerberos at all (which I suspect is a small minority of our installed base) could just dnf install sssd-kcm and so removing it altogether is the way to go.

My understanding is that GOA can't fix this without a notification mechanism from the cache to avoid polling. I will let rishi chime in (added him to needinfo), and indeed, if we can find a way to fix this from GOA, I'm fine with that too.

However, with my Fedora Workstation hat on, this change introduces a performance regression that was not present before that should have been addressed. Most users do not user kerberos services. If no upstream (GOA and SSSD) is willing to do the work to satisfy each other's requirements I'm leaning towards suggesting what's best for the bulk of the userbase and revert the change. Both parties have known this was an issue from the start and two Fedora releases later there has been no progress.

Comment 3 Simo Sorce 2018-11-02 17:52:30 UTC
Unless GOA contantly polls the cache at very a high rate there is no way there can be a performance/battery issue here.
The question is, why GOA is hammering KCM ? Sounds like a bug in GOA (note that I uses sssd-kcm regularly and I do not have high CPU usage, so it is GOA causing it somehow).

Comment 4 Alberto Ruiz 2018-11-02 18:02:58 UTC
(In reply to Simo Sorce from comment #3)
> Unless GOA contantly polls the cache at very a high rate there is no way
> there can be a performance/battery issue here.
> The question is, why GOA is hammering KCM ? Sounds like a bug in GOA (note
> that I uses sssd-kcm regularly and I do not have high CPU usage, so it is
> GOA causing it somehow).

Well that's exactly what's happening, quoting rishi from https://bugzilla.redhat.com/show_bug.cgi?id=1551648#c28 :

> If the credentials cache doesn't support any notification mechanism, goa-
> identity-service polls the credentials cache every 5 seconds using libkrb5 API
> to stay in sync with any changes done using the krb5 CLI (ie., kinit, kdestroy,
> etc.).

I do not know why it needs to do that, that's for rishi to clarify.

Comment 5 Debarshi Ray 2018-11-06 13:48:46 UTC
GOA uses the same code paths for KCM and KEYRING caches, but I won't be surprised if GOA's use of libkrb5 isn't 100% accurate which manifests as weird misbehaviour with KCM caches.

I learnt from Fabiano that KCM has gotten a lot better over time, so I will give it a shot again and see what I can come up with.

Comment 6 Jakub Hrozek 2018-11-06 20:18:41 UTC
The KCM part of libkrb5 is not the best tested part either, so the issue might be anywhere between goa->libkrb5->sssd_kcm.

What would help the most would be to have a reliable reproducer or reproduce the issue on some VM or beaker machine we can poke at.

Comment 7 Alberto Ruiz 2018-11-06 23:00:47 UTC
(In reply to Jakub Hrozek from comment #6)
> The KCM part of libkrb5 is not the best tested part either, so the issue
> might be anywhere between goa->libkrb5->sssd_kcm.
> 
> What would help the most would be to have a reliable reproducer or reproduce
> the issue on some VM or beaker machine we can poke at.

You just need a vanilla F28 or F29 Workstation install, after you login to your default desktop session you should see sssd_kcm eating CPU time.

Comment 8 Jakub Hrozek 2018-11-07 08:29:39 UTC
(In reply to Alberto Ruiz from comment #7)
> (In reply to Jakub Hrozek from comment #6)
> > The KCM part of libkrb5 is not the best tested part either, so the issue
> > might be anywhere between goa->libkrb5->sssd_kcm.
> > 
> > What would help the most would be to have a reliable reproducer or reproduce
> > the issue on some VM or beaker machine we can poke at.
> 
> You just need a vanilla F28 or F29 Workstation install, after you login to
> your default desktop session you should see sssd_kcm eating CPU time.

Nope, not happening here (as it was not happening for the last ~2 years I've been running KCM on my laptop..)

Comment 9 Alberto Ruiz 2018-11-07 11:03:10 UTC
(In reply to Jakub Hrozek from comment #8)
> Nope, not happening here (as it was not happening for the last ~2 years I've
> been running KCM on my laptop..)

What does "here" mean? Is your personal machine a Fedora Workstation 29 vanilla install?

Comment 10 Alberto Ruiz 2018-11-07 11:06:33 UTC
Created attachment 1502929 [details]
Screenshot of sssd_kcm showing up in top on a vanilla F29 Workstation

Here's a screenshot of sssd_kcm showing up in top. The process goes up and down in CPU consumption at regular intervals of a couple of seconds.

Comment 11 Jakub Hrozek 2018-11-07 11:09:58 UTC
(In reply to Alberto Ruiz from comment #9)
> (In reply to Jakub Hrozek from comment #8)
> > Nope, not happening here (as it was not happening for the last ~2 years I've
> > been running KCM on my laptop..)
> 
> What does "here" mean? Is your personal machine a Fedora Workstation 29
> vanilla install?

Yes, since about a week ago when I got a new laptop. Before that, I was using a different machine also with Workstation and KCM and didn't see the issue there either.

Comment 12 Alberto Ruiz 2018-11-07 11:23:55 UTC
(In reply to Jakub Hrozek from comment #11)
> (In reply to Alberto Ruiz from comment #9)
> > (In reply to Jakub Hrozek from comment #8)
> > > Nope, not happening here (as it was not happening for the last ~2 years I've
> > > been running KCM on my laptop..)
> > 
> > What does "here" mean? Is your personal machine a Fedora Workstation 29
> > vanilla install?
> 
> Yes, since about a week ago when I got a new laptop. Before that, I was
> using a different machine also with Workstation and KCM and didn't see the
> issue there either.

I can consistently reproduce on a fresh installed VM and my own machine. Would you please create a VM as I suggested originally and check?

Comment 13 Simo Sorce 2018-11-07 14:42:04 UTC
Alberto,
do you configure the machine in any specific way after first boot ?
Do you join an IPA/AD domain ?
Do you set up GOA to automatically fetch some kerberos credentials ?
Do you just *not* use any kerberos authentication ?

I believe the "auth" setup you apply on your vanilla VM may be important to be able to reproduce.

Comment 14 Alberto Ruiz 2018-11-07 16:22:16 UTC
(In reply to Simo Sorce from comment #13)
> Alberto,
> do you configure the machine in any specific way after first boot ?
> Do you join an IPA/AD domain ?
> Do you set up GOA to automatically fetch some kerberos credentials ?
> Do you just *not* use any kerberos authentication ?
> 
> I believe the "auth" setup you apply on your vanilla VM may be important to
> be able to reproduce.

What I meant with vanilla was that I just create a plain /etc/passwd user, no kerberos auth, no online account of any type no nothing. What a tyipical user would do to install Fedora for the first time.

Comment 15 Jakub Hrozek 2018-11-07 19:48:31 UTC
(In reply to Alberto Ruiz from comment #14)
> (In reply to Simo Sorce from comment #13)
> > Alberto,
> > do you configure the machine in any specific way after first boot ?
> > Do you join an IPA/AD domain ?
> > Do you set up GOA to automatically fetch some kerberos credentials ?
> > Do you just *not* use any kerberos authentication ?
> > 
> > I believe the "auth" setup you apply on your vanilla VM may be important to
> > be able to reproduce.
> 
> What I meant with vanilla was that I just create a plain /etc/passwd user,
> no kerberos auth, no online account of any type no nothing. What a tyipical
> user would do to install Fedora for the first time.

Ah, OK, then I don't have a vanilla setup as the very first thing I do is to configure my redhat.com account..

Comment 16 Ben Cotton 2019-10-31 18:48:55 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 17 Lukas Slebodnik 2019-10-31 21:59:47 UTC
Please file a new BZ with better subject and more deterministic and minimal reproducer.

Comment 18 Robbie Harwood 2019-11-01 16:32:34 UTC
#c0 is pretty clear to me.  Reporter is asking for ccache backend to be switched back to KEYRING due to battery drain on laptops caused by having to poll KCM.

Comment 19 Lukas Slebodnik 2019-11-05 08:44:20 UTC
> #c0 is pretty clear to me.  Reporter is asking for ccache backend to be switched back to KEYRING due to battery drain on laptops caused by having to poll KCM.

I do not have any problem with kcm on my laptop. And sssd-kcm works for many others.
Sure there might be issues in some cases but it does not make a sense to change default for everyone to KEYRING due to corner-cases.
If somebody want to change default for everyone then it need to be processes in fedora way.
Inverse https://fedoraproject.org/wiki/Releases/27/ChangeSet#Kerberos_KCM_credential_cache_by_default

Removing /etc/krb5.conf.d/kcm_default_ccache or even sssd-kcm is trivial solution for changing default back to KEYRING for affected people.

Comment 20 Pavel Březina 2019-11-06 12:06:03 UTC
Lukas, you are not member of the SSSD team and you are not a maintainer of sssd component. We welcome your contributions, but please restrain from doing any decisions for us.

I'm reopening the bug. Since we can't reproduce it I do not think that switching the default is justified, but we should keep tracking it. There were many fixes in KCM for the past year so perhaps it is no longer an issue?

In the meantime, we should prioritize https://pagure.io/SSSD/sssd/issue/3568

Comment 21 Alexey Tikhonov 2019-11-06 12:23:50 UTC
(In reply to Alberto Ruiz from comment #10)
> Created attachment 1502929 [details]
> Screenshot of sssd_kcm showing up in top on a vanilla F29 Workstation
> 
> Here's a screenshot of sssd_kcm showing up in top. The process goes up and
> down in CPU consumption at regular intervals of a couple of seconds.


On this screenshot sssd_kcm is "eating" 1% of CPU. I don't say this is negligible but that doesn't look *that* terrible (esp taking into account at my F29 laptop gnome-shell usually "eats" 10+% at any given random moment of time...)

Comment 22 Alberto Ruiz 2019-11-06 13:58:03 UTC
(In reply to Alexey Tikhonov from comment #21)
> (In reply to Alberto Ruiz from comment #10)
> > Created attachment 1502929 [details]
> > Screenshot of sssd_kcm showing up in top on a vanilla F29 Workstation
> > 
> > Here's a screenshot of sssd_kcm showing up in top. The process goes up and
> > down in CPU consumption at regular intervals of a couple of seconds.
> 
> 
> On this screenshot sssd_kcm is "eating" 1% of CPU. I don't say this is
> negligible but that doesn't look *that* terrible (esp taking into account at
> my F29 laptop gnome-shell usually "eats" 10+% at any given random moment of
> time...)

I am not impressed by that argument, 1% of the CPU is outrageously high for a pervasive daemon that is not even used by most people, you are only second to GNOME Shell there and that has an impact in battery life. My question is, why is it consuming 1%, what value is it bringing to the end user? Is it worth it? Can we improve it?

Each little daemon using GNOME Shell as an excuse to eat obscene amounts of RAM or CPU is not an impressive argument and would render the Fedora default install unusable. Please take ownership of the impact of your design decisions on the overall system and the end users.

Comment 23 Alexey Tikhonov 2019-11-06 15:19:11 UTC
(In reply to Alberto Ruiz from comment #22)
> > > Here's a screenshot of sssd_kcm showing up in top. The process goes up and
> > > down in CPU consumption at regular intervals of a couple of seconds.
> > 
> > On this screenshot sssd_kcm is "eating" 1% of CPU. I don't say this is
> > negligible but that doesn't look *that* terrible (esp taking into account at
> > my F29 laptop gnome-shell usually "eats" 10+% at any given random moment of
> > time...)
> 
> I am not impressed by that argument, 1% of the CPU is outrageously high for
> a pervasive daemon that is not even used by most people

I just wanted to make sure we are on the same page when talk about "considerable impact on memory and CPU usage"

Well, if this is it, then it is really easily reproducible. Typical top on my F29 VM:

 2930 test      20   0 3071680 252424 105284 S   1.7   6.3   0:09.29 gnome-shell                                
 3074 root      20   0  253684  30388   9100 S   1.0   0.8   0:01.98 sssd_kcm         <-----------                            
 1376 dirsrv    20   0 1716528 355488  70532 S   0.7   8.8   0:56.30 ns-slapd                                   
 3487 test      20   0  569700  43788  32984 S   0.7   1.1   0:01.03 gnome-terminal-                            
 2055 pkiuser   20   0 3759372 509716  30400 S   0.3  12.6   0:30.09 java                                       
 3067 test      20   0  471820  15476  13372 S   0.3   0.4   0:00.14 goa-identity-se                            
 3575 test      20   0  228912   5180   4280 R   0.3   0.1   0:00.99 top                                        
    1 root      20   0  171624  14444   9392 S   0.0   0.4   0:02.22 systemd                                    
   

But reading bz 1551648, it seems to be the root cause...

Comment 24 Simo Sorce 2019-11-06 15:41:10 UTC
I am surprised sssd-kcm uses that much CPU on ant given system, do we have any idea what it is doing?
Id there some other process in the system that keeps querying the kerberos ccache causing it to do work?
Or do we have some timed tasks we can rearrange to avoid firing when there is nothing to do?

Comment 25 Alexey Tikhonov 2019-11-06 15:52:51 UTC
(In reply to Simo Sorce from comment #24)
> I am surprised sssd-kcm uses that much CPU on ant given system, do we have
> any idea what it is doing?
> Id there some other process in the system that keeps querying the kerberos
> ccache causing it to do work?

I got an impression that this process was goa (according to https://bugzilla.redhat.com/show_bug.cgi?id=1551648#c27)

Comment 28 Ben Cotton 2020-02-11 15:43:15 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle.
Changing version to 32.

Comment 29 Michal Zidek 2020-04-08 10:25:45 UTC
Hi, with current master I no longer see sssd_kcm using this much CPU. Maybe it was already fixed.

Comment 30 Orion Poplawski 2020-06-01 21:36:46 UTC
Hopefully.  Still seeing with:

sssd-kcm-2.2.3-13.fc31.x86_64
gnome-online-accounts-3.34.1-1.fc31.x86_64

Comment 31 Alexey Tikhonov 2020-06-04 11:19:38 UTC
*** Bug 1670451 has been marked as a duplicate of this bug. ***

Comment 32 Simo Sorce 2020-06-04 18:12:41 UTC
Can we please change the title of this bug to reflect what issue is being tracked?

If the issue really is a RFE to change back to keyring I'd propose to close this bug as WONTFIX, becasue KCM is the only sane way to share ccaches with containers and flatpacks for example.
Namespacing keyring helps but does not allow you to build a portal for faltpack for example which is what is needed in that case.

Comment 33 Simone Caronni 2020-06-05 14:11:00 UTC
Being bitten by it here as well. A bit of history:

At work we use Fedora laptops connected to FreeIPA. In the past (around Fedora 30), we had this issue and we removed sssd-kcm to make it work. We retested with Fedora 31 and it was working fine, so we re-enabled sssd-kcm on the laptops.

Now that we upgraded to Fedora 32, we see it again. sssd_kcm hitting 100% CPU and the only way to make it spin down is killing goa-identity-service.

Comment 34 Simone Caronni 2020-06-05 14:22:50 UTC
By then starting manually goa-identity-service like the following:

$ G_MESSAGES_DEBUG=all /usr/libexec/goa-identity-service

I get hundreds of lines per second on the screen with this entries:

(process:62345): libgoaidentity-DEBUG: 16:21:41.900: GoaKerberosIdentityManager: refreshing identity 'simone.caronni'
(process:62345): libgoaidentity-DEBUG: 16:21:41.900: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni
(process:62345): libgoaidentity-DEBUG: 16:21:41.912: GoaKerberosIdentityManager: refreshing identity 'simone.caronni'
(process:62345): libgoaidentity-DEBUG: 16:21:41.913: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni
(process:62345): libgoaidentity-DEBUG: 16:21:41.914: GoaKerberosIdentityManager: Waiting for next operation
(process:62345): libgoaidentity-DEBUG: 16:21:41.914: GoaKerberosIdentityManager: Refreshing identities
(process:62345): libgoaidentity-DEBUG: 16:21:41.929: GoaKerberosIdentityManager: refreshing identity 'simone.caronni'
(process:62345): libgoaidentity-DEBUG: 16:21:41.930: GoaKerberosIdentityManager: identity unexpired
(process:62345): libgoaidentity-DEBUG: 16:21:41.930: GoaKerberosIdentityManager: identity 'simone.caronni' refreshed
(process:62345): libgoaidentity-DEBUG: 16:21:41.932: GoaIdentityService: credentials for account simone.caronni ensured for 73340 seconds
(process:62345): libgoaidentity-DEBUG: 16:21:41.944: GoaKerberosIdentityManager: refreshing identity 'simone.caronni'
(process:62345): libgoaidentity-DEBUG: 16:21:41.944: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni
(process:62345): libgoaidentity-DEBUG: 16:21:41.946: GoaIdentityService: identity simone.caronni expired
(process:62345): libgoaidentity-DEBUG: 16:21:41.948: GoaIdentityService: could not ensure credentials for account simone.caronni: GDBus.Error:org.gnome.OnlineAccounts.Error.NotAuthorized: No credentials found in the keyring
(process:62345): libgoaidentity-DEBUG: 16:21:41.957: GoaKerberosIdentityManager: refreshing identity 'simone.caronni'
(process:62345): libgoaidentity-DEBUG: 16:21:41.957: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni
(process:62345): libgoaidentity-DEBUG: 16:21:41.967: GoaKerberosIdentityManager: refreshing identity 'simone.caronni'
(process:62345): libgoaidentity-DEBUG: 16:21:41.967: GoaKerberosIdentity: expiration alarm fired for identity simone.caronni
(process:62345): libgoaidentity-DEBUG: 16:21:41.969: GoaKerberosIdentityManager: Waiting for next operation
(process:62345): libgoaidentity-DEBUG: 16:21:41.969: GoaKerberosIdentityManager: Refreshing identities

Comment 35 Simone Caronni 2020-06-05 14:25:18 UTC
And goa-daemon --replace keeps on printing this:

(goa-daemon:63090): GoaBackend-WARNING **: 16:24:52.576: secret_password_lookup_sync() returned NULL
goa-daemon-Message: 16:24:52.576: /org/gnome/OnlineAccounts/Accounts/account_1591340703_0: Setting AttentionNeeded to TRUE because EnsureCredentials() failed with: No credentials found in the keyring (goa-error-quark, 4)
goa-daemon-Message: 16:24:52.785: /org/gnome/OnlineAccounts/Accounts/account_1591340703_0: Setting AttentionNeeded to FALSE because EnsureCredentials() succeded

Comment 36 Simone Caronni 2020-06-23 09:15:20 UTC
If i clear the credential the multiple expired caches, goa-identity-service stops polling like crazy. It seems it's trying to renew the KCM entries that are expired instead of using the valid one:

$ klist -l
Principal name                 Cache name
--------------                 ----------
simone.caronni.XX  KCM:46200010:44571 (Expired)
simone.caronni.XX  KCM:46200010:17939
simone.caronni.XX  KCM:46200010:40637 (Expired)
simone.caronni.XX  KCM:46200010:97892 (Expired)

(realm redacted)

Comment 37 Simone Caronni 2020-06-23 09:15:56 UTC
Related KCM bug: https://bugzilla.redhat.com/show_bug.cgi?id=1780308

Comment 38 Eric Lavarde 2020-06-30 15:46:03 UTC
In case it helps others: I had only one expired entry in `klist -l`, renewed it with `kinit`, and then the sssd-kcm process dropped from 50-100% CPU down to <=1%. The issue happened AFAICJ for the first time in my love life with Fedora (since 19 or 20, now on 32).
And just because it took me a while to find out: GOA = gnome-online-accounts (I assume).

Comment 39 Pavel Březina 2020-07-08 09:23:17 UTC
This is interesting, so it seems that GOA is polling for all (even expired) available tickets which triggers the high cpu usage?

SSSD has implemented notification mechanism and we have currently provided scratch build to desktop team to see if it fits their need. But this is something that should be checked in GOA anyway.

Comment 40 Gabriel Somlo 2020-08-11 20:06:14 UTC
Not sure it's the same issue, but I recently noticed sssd-kcm briefly (10-20 seconds) spiking to 100% cpu each time I bring up a remote VM's console in virt-manager (over a remote connection using SSH, with pubkey authentication).

My machine is otherwise self-contained F32 (local account, no need to use kerberos besides whatever is involved with fedpkg). Happy to collect whatever test data might be useful to troubleshoot, please advise.

Comment 41 Pavel Březina 2020-08-24 09:05:05 UTC
I think 10-20 seconds spike may be normal, because KCM is not running in the background all the time. It is started only when there is some activity so this may be just kcm starting and serving the request. But feel free to attach sssd_kcm.log logs (/var/log/sssd/sssd_kcm.log) after setting debug_level = 0x3ff0 in /etc/sssd/sssd.conf:

[kcm]
debug_level = 0x3ff0

Output of the ssh connection with -vvv may be also helpful to see if there is supposed to be any interaction with kcm.

Comment 42 Gabriel Somlo 2020-08-24 15:11:08 UTC
(In reply to Pavel Březina from comment #41)
> I think 10-20 seconds spike may be normal, because KCM is not running in the
> background all the time. It is started only when there is some activity so
> this may be just kcm starting and serving the request.

Not only does kcm spike the cpu for 20-ish seconds when I try to open a VM console in virt-manager, the vm console opening itself is delayed for roughly the same time interval (i.e., it's waiting for sssd_kcm to spin its wheels, then give up, before moving on with the connection).

> But feel free to
> attach sssd_kcm.log logs (/var/log/sssd/sssd_kcm.log) after setting
> debug_level = 0x3ff0 in /etc/sssd/sssd.conf:
> 
> [kcm]
> debug_level = 0x3ff0

There was no sssd.conf file in /etc/sssd/ on my machine. I created it with the content you indicated above, then did "systemctl enable sssd.socket" before opening one of my VMs in virt-manager. After the whole song and dance was over, I shut down both sssd_kcm .service and .socket, but there was only one line in the log file:

    (2020-08-24 10:52:23): [kcm] [orderly_shutdown] (0x0010): SIGTERM: killing children

> Output of the ssh connection with -vvv may be also helpful to see if there
> is supposed to be any interaction with kcm.

It's plausible that the ssh client will try a bunch of auth methods in some sort of order, and that it tries kerberos auth before falling back to pubkey, which is what I'm using between my workstation and the hypervisor running the VMs. Which would explain why sssd_kcm "spinning" is delaying my VM guest consoles coming up.

But I don't know how I would add "-vvv" to the ssh client session started by virt-manager. I did a plain command-line `ssh -vvv root@server` from the same workstation and am attaching that log. If there's anything else I can collect in terms of "telemetry", please advise.

Thanks!

Comment 43 Gabriel Somlo 2020-08-24 15:11:55 UTC
Created attachment 1712383 [details]
output of `ssh -vvv` while sssd_kcm socket is enabled

Comment 44 Robbie Harwood 2020-08-24 17:13:00 UTC
> I think 10-20 seconds spike may be normal, because KCM is not running in the background all the time. It is started only when there is some activity so this may be just kcm starting and serving the request.

If this is in fact what's happening, it may be a good idea to revisit the decision to have KCM shut itself off.  10-20 seconds of CPU spike on a modern CPU is a *lot* of activity.

Comment 45 Pavel Březina 2020-08-27 09:44:02 UTC
Gabriel, ssh is really trying Kerberos authentication (gssapi) before pubkey.

debug1: Next authentication method: gssapi-with-mic
debug1: Unspecified GSS failure.  Minor code may provide more information
Configuration file does not specify default realm


debug1: Unspecified GSS failure.  Minor code may provide more information
Configuration file does not specify default realm

Enable kcm logging like this:

echo -e "\n[kcm]\ndebug_level=0x3ff0" >> /etc/sssd/sssd.conf
chmod 600 /etc/sssd/sssd.conf
systemctl restart sssd
systemctl stop sssd-kcm

Then reproduce the issue, /var/log/sssd/sssd_kcm.log should be populated.

Comment 46 Gabriel Somlo 2020-08-27 13:49:55 UTC
Created attachment 1712821 [details]
sssd_kcm.log collected while opening VM guest over ssh in virt-manager

Comment 47 jstephen 2020-08-27 20:35:10 UTC
(In reply to Gabriel Somlo from comment #42)
> (In reply to Pavel Březina from comment #41)
> > I think 10-20 seconds spike may be normal, because KCM is not running in the
> > background all the time. It is started only when there is some activity so
> > this may be just kcm starting and serving the request.
> 
> Not only does kcm spike the cpu for 20-ish seconds when I try to open a VM
> console in virt-manager, the vm console opening itself is delayed for
> roughly the same time interval (i.e., it's waiting for sssd_kcm to spin its
> wheels, then give up, before moving on with the connection).

Something interesting I notice when making a connection to a remote QEMU/KVM instance, opening the VM guest console causes virt-manager to start several ssh processes one after the other to handle the qemu+ssh remote connection

$ virt-manager -c 'qemu+ssh://root.test/system?keyfile=id_rsa'

  ## Wait for guest console to load/display

$ ps -ef | egrep 'virt-manager|ssh'

justin      8827    2496  3 16:08 ?        00:00:17 /usr/bin/python3 /usr/share/virt-manager/virt-manager
justin     10126    8827  0 16:15 ?        00:00:00 ssh -l root -i id_rsa -T -e none -- fedora-32.jstephen.test sh -c 'if 'nc' -q 2>&1 | grep "requires an argument" >/dev/null 2>&1; then ARG=-q0;else ARG=;fi;'nc' $ARG -U /var/run/libvirt/libvirt-sock'
justin     10316    8827  0 16:16 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'
justin     10387    8827  0 16:17 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'
justin     10423    8827  0 16:17 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'
justin     10502    8827  0 16:17 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'
justin     10524    8827  0 16:17 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'
justin     10559    8827  0 16:17 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'
justin     10575    8827  0 16:17 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'
justin     10621    8827  0 16:17 ?        00:00:00 ssh -l root fedora-32.jstephen.test sh -c 'nc -q 2>&1 | grep "requires an argument" >/dev/null;if [ $? -eq 0 ] ; then   CMD="nc -q 0 127.0.0.1 5900";else   CMD="nc 127.0.0.1 5900";fi;eval "$CMD";'

strace of sssd_kcm indicates that these ssh commands are the actual client sending requests through libkrb5 to KCM(as opposed to /usr/libexec/goa-identity-service in the earlier comments). I believe that each ssh command forked attempts client GSSAPI authentication resulting in KCM CPU utilization increase and an evident virt-manager delay.


(In reply to Gabriel Somlo from comment #46)
> Created attachment 1712821 [details]
> sssd_kcm.log collected while opening VM guest over ssh in virt-manager

One thing I don't quite understand from these provided logs is why KCM receives 900+ GET_CRED_BY_UUID operation requests in less than a 5 second duration.

$ grep -irn '9:43:3[5-8].*GET_CRED_BY_UUID' sssd_kcm.log | grep -vi 'return' | wc -l
928

Comment 48 Lukas Zapletal 2020-09-01 12:23:55 UTC
I can confirm Eric's comment 38, a single expired entry caused the process to spike up to 100%, kinit calmed it down. Not sure what. Even then the process keeps returning with 1-2% when the system is idling. Sounds like something is going on even on idle.

Comment 49 Pavel Březina 2020-09-03 10:29:34 UTC
(In reply to jstephen from comment #47)
> One thing I don't quite understand from these provided logs is why KCM
> receives 900+ GET_CRED_BY_UUID operation requests in less than a 5 second
> duration.
> 
> $ grep -irn '9:43:3[5-8].*GET_CRED_BY_UUID' sssd_kcm.log | grep -vi 'return'
> | wc -l
> 928

We have seen this also in the GOA case (70k requests in two minutes). This is obviously what's spiking the cpu but we were never able to reproduce it. Perhaps the key to reproduce it is to let credentials expire. Justin, can you try it please?

Comment 50 Robbie Harwood 2020-09-03 14:32:56 UTC
Some Kerberos context if it's helpful:

In both MIT krb5 and Heimdal, GET_CRED_BY_UUID is produced by krb5_cc_next_cred - in other words, during iteration.  Simplified slightly, the code usually resembles:

```C
krb5_cc_start_seq_get(context, cache, &cur);
while ((ret = krb5_cc_next_cred(context, cache, &cur, &creds)) == 0) {
    do_something(creds);
    krb5_free_cred_contents(context, &creds);
}
krb5_cc_end_seq_get(context, cache, &cur);
```

Unless an internal error occurs, krb5 will just return the status code from KCM as the result code from krb5_cc_next_cred.  If expired credentials are related, maybe the status codes in that case needs investigation?

Comment 51 jstephen 2020-09-03 17:32:41 UTC
(In reply to Pavel Březina from comment #49)
> (In reply to jstephen from comment #47)
> > One thing I don't quite understand from these provided logs is why KCM
> > receives 900+ GET_CRED_BY_UUID operation requests in less than a 5 second
> > duration.
> > 
> > $ grep -irn '9:43:3[5-8].*GET_CRED_BY_UUID' sssd_kcm.log | grep -vi 'return'
> > | wc -l
> > 928
> 
> We have seen this also in the GOA case (70k requests in two minutes). This
> is obviously what's spiking the cpu but we were never able to reproduce it.
> Perhaps the key to reproduce it is to let credentials expire. Justin, can
> you try it please?

I have been unable to reproduce this, even with multiple expired TGTs in the ccache(setting the system date forward a day to force the ticket expiration) I am consistently seeing goa-identity-service triggering a refresh operation every 5 seconds on my fedora 32 system.

(process:30194): libgoaidentity-DEBUG: 03:54:13.447: GoaKerberosIdentityManager: Refreshing identities                                                                                                                                        
(process:30194): libgoaidentity-DEBUG: 03:54:18.448: GoaKerberosIdentityManager: Refreshing identities                                                                                                                                        
(process:30194): libgoaidentity-DEBUG: 03:54:23.447: GoaKerberosIdentityManager: Refreshing identities                                                                                                                                        
(process:30194): libgoaidentity-DEBUG: 03:54:28.448: GoaKerberosIdentityManager: Refreshing identities     



(In reply to Simone Caronni from comment #34)
> By then starting manually goa-identity-service like the following:
> 
> $ G_MESSAGES_DEBUG=all /usr/libexec/goa-identity-service
> 
> I get hundreds of lines per second on the screen with this entries:
> 
> (process:62345): libgoaidentity-DEBUG: 16:21:41.900:
> GoaKerberosIdentityManager: refreshing identity
> 'simone.caronni'
> (process:62345): libgoaidentity-DEBUG: 16:21:41.900: GoaKerberosIdentity:
> expiration alarm fired for identity simone.caronni
> (process:62345): libgoaidentity-DEBUG: 16:21:41.912:
> GoaKerberosIdentityManager: refreshing identity
> 'simone.caronni'
> (process:62345): libgoaidentity-DEBUG: 16:21:41.913: GoaKerberosIdentity:
> expiration alarm fired for identity simone.caronni
> (process:62345): libgoaidentity-DEBUG: 16:21:41.914:
> GoaKerberosIdentityManager: Waiting for next operation
> (process:62345): libgoaidentity-DEBUG: 16:21:41.914:
> GoaKerberosIdentityManager: Refreshing identities
> (process:62345): libgoaidentity-DEBUG: 16:21:41.929:
> GoaKerberosIdentityManager: refreshing identity
> 'simone.caronni'
> (process:62345): libgoaidentity-DEBUG: 16:21:41.930:
> GoaKerberosIdentityManager: identity unexpired
> (process:62345): libgoaidentity-DEBUG: 16:21:41.930:
> GoaKerberosIdentityManager: identity 'simone.caronni'
> refreshed
> (process:62345): libgoaidentity-DEBUG: 16:21:41.932: GoaIdentityService:
> credentials for account simone.caronni ensured for 73340
> seconds
> (process:62345): libgoaidentity-DEBUG: 16:21:41.944:
> GoaKerberosIdentityManager: refreshing identity
> 'simone.caronni'
> (process:62345): libgoaidentity-DEBUG: 16:21:41.944: GoaKerberosIdentity:
> expiration alarm fired for identity simone.caronni
> (process:62345): libgoaidentity-DEBUG: 16:21:41.946: GoaIdentityService:
> identity simone.caronni expired
> (process:62345): libgoaidentity-DEBUG: 16:21:41.948: GoaIdentityService:
> could not ensure credentials for account simone.caronni:
> GDBus.Error:org.gnome.OnlineAccounts.Error.NotAuthorized: No credentials
> found in the keyring
> (process:62345): libgoaidentity-DEBUG: 16:21:41.957:
> GoaKerberosIdentityManager: refreshing identity
> 'simone.caronni'
> (process:62345): libgoaidentity-DEBUG: 16:21:41.957: GoaKerberosIdentity:
> expiration alarm fired for identity simone.caronni
> (process:62345): libgoaidentity-DEBUG: 16:21:41.967:
> GoaKerberosIdentityManager: refreshing identity
> 'simone.caronni'
> (process:62345): libgoaidentity-DEBUG: 16:21:41.967: GoaKerberosIdentity:
> expiration alarm fired for identity simone.caronni
> (process:62345): libgoaidentity-DEBUG: 16:21:41.969:
> GoaKerberosIdentityManager: Waiting for next operation
> (process:62345): libgoaidentity-DEBUG: 16:21:41.969:
> GoaKerberosIdentityManager: Refreshing identities


The logs from this comment show a refresh happening much more frequently than 5 seconds.

(process:62345): libgoaidentity-DEBUG: 16:21:41.914: GoaKerberosIdentityManager: Refreshing identities
(process:62345): libgoaidentity-DEBUG: 16:21:41.969: GoaKerberosIdentityManager: Refreshing identities

In gnome-online-accounts code, there are other ways that a refresh is triggered in addition to the 5 seconds polling timeout:

   1    301  src/goaidentity/goakerberosidentitymanager.c <<on_identity_unexpired>>
             schedule_refresh (self);
   2    327  src/goaidentity/goakerberosidentitymanager.c <<on_identity_needs_refresh>>
             schedule_refresh (self);
   3   1320  src/goaidentity/goakerberosidentitymanager.c <<on_credentials_cache_changed>>
             schedule_refresh (self);
   4   1326  src/goaidentity/goakerberosidentitymanager.c <<on_polling_timeout>>
             schedule_refresh (self);
   5   1499  src/goaidentity/goakerberosidentitymanager.c <<goa_kerberos_identity_manager_initable_init>>
             schedule_refresh (self);


The first of these callers may be related to this problem:

(process:62345): libgoaidentity-DEBUG: 16:21:41.930: GoaKerberosIdentityManager: identity unexpired

It would be great to see a larger sample of debug logs from gnome-identity-service when sssd_kcm is utilizing significant CPU. I hope that should give some clues to continue further investigation, or be able to reproduce this issue locally.

  # G_MESSAGES_DEBUG=all /usr/libexec/goa-identity-service

Comment 52 Pavel Březina 2020-09-04 09:12:50 UTC
(In reply to Robbie Harwood from comment #50)
> Some Kerberos context if it's helpful:
> 
> In both MIT krb5 and Heimdal, GET_CRED_BY_UUID is produced by
> krb5_cc_next_cred - in other words, during iteration.  Simplified slightly,
> the code usually resembles:
> 
> ```C
> krb5_cc_start_seq_get(context, cache, &cur);
> while ((ret = krb5_cc_next_cred(context, cache, &cur, &creds)) == 0) {
>     do_something(creds);
>     krb5_free_cred_contents(context, &creds);
> }
> krb5_cc_end_seq_get(context, cache, &cur);
> ```
> 
> Unless an internal error occurs, krb5 will just return the status code from
> KCM as the result code from krb5_cc_next_cred.  If expired credentials are
> related, maybe the status codes in that case needs investigation?

So if there is only one ticket in ccache krb5_cc_next_cred() should be called only once and the result code should be what precisely? KRB5_CC_END?

Comment 53 Robbie Harwood 2020-09-04 18:24:21 UTC
Almost.  You're right that KRB5_CC_END is the correct code to stop iteration when there are no more credentials.  (In practice, any nonzero value will stop iteration, since applications can't really do any recovery there anyway.)

However, in your scenario, krb5_cc_next_cred() gets called twice: the first time it should succeed and return the ticket in the ccache, and the second time it should indicate completion by returning KRB5_CC_END.

Comment 54 Pavel Březina 2020-09-14 12:11:54 UTC
There does not seem to be any problem with the return code since the end of the cursor is not tracked by KCM but by Kerberos. [1]

I wanted to debug it with Lukas (comment #48) but unfortunately he can not reproduce it anymore.

Gabriel, can you still reproduce it? If yes, could use please set debug_level=0xfff0 (more verbose than what you set previously) and attach new logs please? Would a remote debugging session (e.g. over tmate) be possible with you?

[1] https://github.com/krb5/krb5/blob/master/src/lib/krb5/ccache/cc_kcm.c#L787

Comment 55 Pavel Březina 2020-09-14 12:43:59 UTC
Also the  output of klist -l and klist -A -C might be helpful to see how many credentials are in the ccache.

Comment 56 Gabriel Somlo 2020-09-14 13:20:39 UTC
Created attachment 1714776 [details]
output of klist

Comment 57 Gabriel Somlo 2020-09-14 13:23:44 UTC
Created attachment 1714778 [details]
sssd_kcm.log (xzip compressed) with debug_level=0xfff0

Pavel: if the stuff I attached doesn't shed any light on the problem, I'll set up a "clean" VM from the same kickstart I used with my desktop, and configure it to allow access to the same hypervisor I'm connecting to using virt-manager. I'll figure out a way to get you access to it to try for yourself. But it might take a few days, so hopefully the logs contain a better clue :)

Thanks much,
--Gabriel

Comment 58 Ademar Reis 2020-09-22 12:56:55 UTC
Landed on this bug after a google search, so here are my two cents: Fedora 32, fresh install a while ago + regular dnf upgrades. Last week I started noticing my laptop fan spinning all the time and top showed sssd_kcm consuming 100% cpu memory. Rebooted the laptop on the weekend, kept it on yesterday, noticed it again today (Tuesday).

Unfortunately I couldn't debug this further. I removed sssd_kcm given AFAIK it's not essential to my desktop use-case.

Comment 59 Ademar Reis 2020-09-22 12:58:11 UTC
(In reply to Ademar Reis from comment #58)
> Landed on this bug after a google search, so here are my two cents: Fedora
> 32, fresh install a while ago + regular dnf upgrades. Last week I started
> noticing my laptop fan spinning all the time and top showed sssd_kcm
> consuming 100% cpu memory. Rebooted the laptop on the weekend, kept it on
> yesterday, noticed it again today (Tuesday).

s/cpu memory/cpu/

Comment 60 Pavel Březina 2020-09-23 10:42:51 UTC
Thank you. I can reproduce high CPU load by artificially obtaining lots of service tickets.

The problem is that the KCM procotol that is implemented by sssd-kcm is not very efficient and when a process needs to iterate over ccache there is one round trip required for every credential between the process and sssd-kcm. The more credentials there are the more time it takes for sssd-kcm to answer each requests. And to makes things worse gssapi which is used by ssh to try to connect to a host via Kerberos goes over each ccache serveral times (16 times if I remember that correctly) and Gabriel's particular situations requires around 5k requests to sssd-kcm (300 credentials * 16) and it creates this heavy load and delay. Also gnome-online-accounts polls ccache every 5 seconds which also takes some time for larger amount of credentials.

I have identified at least two bottlenecks (ccache encryption and parsing json data). Justin is going to work on that.

Gabriel, there are lot's of duplicate refresh_time control credentials which causes your ccache to grow. These are special type of information stored in ccache that is used by various mechanisms. However, there should be only one such record. I was not able to reproduce it and get more then one refresh time. Can you please describe how you use your Kerberos, how do you kinit? Can you check after what operation the number of entries in klist -C grows?

Comment 61 Gabriel Somlo 2020-09-23 16:07:20 UTC
Pavel,

All I really ever do with Kerberos is use it to occasionally update RPM packages I maintain in fedora:

```
kinit somlo
```

Over time, I ended up with all that cruft in the output of `klist -C` I showed you before, and the following in the output of just `klist` witout `-C`:

```
08/09/2020 20:41:18  08/10/2020 20:39:03  HTTP/src.fedoraproject.org
	renew until 08/16/2020 20:39:03
config: fast_avail(krbtgt/FEDORAPROJECT.ORG) = yes
08/09/2020 20:39:06  08/10/2020 20:39:03  krbtgt/FEDORAPROJECT.ORG
	renew until 08/16/2020 20:39:03
config: pa_type(krbtgt/FEDORAPROJECT.ORG) = 151
08/09/2020 20:42:18  08/10/2020 20:39:03  HTTP/koji.fedoraproject.org
	renew until 08/16/2020 20:39:03
```

Based on your last message, I dug around and found `kdestroy -A`. After running it, sssd-kcm now only goes up to 20% CPU for a few seconds while I bring up a remote VM over a `qemu+ssh://root@hypervisor` connection, not enough to start my laptop's fans whining :)

I just re-ran `kinit somlo`, and all I see in `klist -C` is:

```
Ticket cache: KCM:1000
Default principal: somlo

Valid starting       Expires              Service principal
config: fast_avail(krbtgt/FEDORAPROJECT.ORG) = yes
09/23/2020 11:56:23  09/24/2020 11:56:19  krbtgt/FEDORAPROJECT.ORG
	renew until 09/30/2020 11:56:19
config: pa_type(krbtgt/FEDORAPROJECT.ORG) = 2
```

So the way I see it, I was having two problems:

1. an accumulation of cruft in my ccache, which I cleaned up with `kdestroy -A`
2. the fact that sssd-kcm does suboptimal things when the cache grows unwieldy (which I now realize may be only *partially* its fault...)

Armed with this knowledge, I will now re-enable sssd-kcm.socket on my system (I was only firing it up selectively during RPM maintenance jobs), and use `kdestroy -A` liberally whenever things get noticeably screwed up over time, which I will start noticing whenever I bring up remote VMs in virt-manager.

Hope this makes sense, and gives you some useful hints as to what, if anything, may still be broken...

Thanks,
--Gabriel

Comment 62 Pavel Březina 2020-09-29 12:29:50 UTC
Thanks, I will try to reproduce it with fedora services.

If you don't require the tickets to outlive reboot, you can set ccache_storage = memory in [kcm], this is much faster but the tickets are not persistent.

Comment 63 Pavel Březina 2020-10-05 12:59:07 UTC
Upstream ticket:
https://github.com/SSSD/sssd/issues/5349

Comment 64 Pavel Březina 2020-10-23 15:28:29 UTC
Upstream PR:
https://github.com/SSSD/sssd/pull/5375

Comment 67 Ken Dreyer 2020-11-30 04:36:48 UTC
I'm hitting this as well. My laptop uses krb5 to authenticate to an el8 NFS server with autofs, and I have dozens of strange entries in root's cache:

[root@phos ~]# klist -A
Ticket cache: KCM:0:75962
Default principal: nfs/phos.ktdreyer.com

Valid starting       Expires              Service principal
12/31/1969 17:00:00  12/31/1969 17:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
12/31/1969 17:00:00  12/31/1969 17:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
12/31/1969 17:00:00  12/31/1969 17:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
12/31/1969 17:00:00  12/31/1969 17:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
[tons of identical entries snipped here]

Killing the sssd_kcm PID or stopping the systemd service immediately spawns a new process with 100% CPU again.

I've tried clearing every cache by running "kdestroy -A" as my normal UID and as root, with no effect.

I tried creating /etc/sssd/sssd.conf with "ccache_storage = memory" in a "[kcm]" section, with no effect.

Comment 68 Roberto Cornacchia 2020-11-30 15:21:59 UTC
I had the same issues with gdm and console logins (not with ssh).

In a freeIPA environment, the only host (FC33) with KCM hangs at login. I suppose it is the only host using KCM because it was the only one being installed from scratch with FC32, all the others were upgraded from previous FC releases.
No GOA installed, KCM runs at 100%.

The logs show hundreds of GET_CRED_BY_UUID requests. klist shows indeed hundreds of tickets:

$ klist -A
Ticket cache: KCM:1172000026:72700
Default principal: host/host1.lan.example.com.COM

Valid starting       Expires              Service principal
01/01/1970 01:00:00  01/01/1970 01:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
01/01/1970 01:00:00  01/01/1970 01:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
01/01/1970 01:00:00  01/01/1970 01:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
01/01/1970 01:00:00  01/01/1970 01:00:00  Encrypted/Credentials/v1@X-GSSPROXY:
(hundreds of these)

As these are about a host principal, I suppose they were generated by freeIPA, not by user logins.
I don't quite understand in which situation they are generated, and why they are dated at epoch 0.

I do hope PR https://github.com/SSSD/sssd/pull/5375 will improve the situation, but I remain ignorant about what causes the accumulation of (this kind of) tickets in the first place.

Comment 69 Alexander Bokovoy 2020-11-30 15:37:48 UTC
The entries you see in the KCM ccache are normal entries generated by the GSS-Proxy. They are encrypted and dates of the outer entry (of encrypted ticket) are fake, the real ones are inside these entries are accessible to gssproxy process.

Nothing in FreeIPA actually generates these in KCM entries. It looks like you have some software that runs under UID 1172000026 and is configured to authenticate with a host keytab through the GSS-Proxy. At least, this is not a default for any FreeIPA-related configuration.

Comment 70 Simo Sorce 2020-11-30 16:00:44 UTC
As Alexander says, however those entries should not pile up, every new entry should replace the previous one, so we need to investigate why this is not happening, and they are piling up instead.

Comment 71 Roberto Cornacchia 2020-11-30 16:07:30 UTC
Alexander, Simo, thanks for your clarifications, it helps.
I will look for any process that might be trying to authenticate.

Comment 72 Simo Sorce 2020-11-30 16:57:36 UTC
Roberto,
generally the process is initiated by an NFS client mount and operated by rpc.gssd intercepted by gss-proxy.

Comment 73 Roberto Cornacchia 2020-11-30 17:23:17 UTC
Simo, good point. 
Indeed we have automounted home folders. But CephFS mounts, not NFS

Comment 74 Roberto Cornacchia 2020-11-30 18:15:53 UTC
About "piling up": I removed all existing entries with kdestroy -A (checked klist -A, it was empty).
After restarting sssd-kcm and gdm (gdm was confused), the login from gdm worked perfectly, but immediately after that all entries were back there. That is, they didn't pile up over time, one login is enough.

Comment 75 Roberto Cornacchia 2020-11-30 18:30:09 UTC
Please disregard my last comment. It actually isn't true. After removing all entries it looks good, also after a login. I had issued the command as root (which apparently also needs some cleaning up).

Comment 76 Pavel Březina 2020-12-04 10:49:33 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/5375

* `master`
    * 325de5a5bb97ba026be6d22492bea8ab2605f1b5 - secrets: remove base64 enctype
    * 39277cdadd317b0ab86cdd37de0616bc3eecbe6a - secrets: move attrs names to macros
    * 9c1b51d057390fb5b26151f814a480911cda4cc9 - secrets: default to "plaintext" if "enctype" attr is missing
    * bf127d4f3f42e5b2afe25e512211439bc12a9904 - secrets: fix may_payload_size exceeded debug message
    * c3b314db57c34f64aaca7d74e76a9a955288bb51 - kcm: store credentials list in hash table to avoid cache lookups
    * a370553c90c2ed6df3b94c169c4960a6f978031f - sss_ptr_hash: fix double free for circular dependencies
    * 241ee30da12f564803793ee2b14c1522aabd9235 - kcm: add per-connection data to be shared between requests
    * 194447d35c11eb914f54719491dc5cfaab01b9a1 - kcm: use binary format to store ccache instead of json
    * f17740d831e16449495fff4ec57cc4800aaac83d - kcm: add spaces around operators in kcmsrv_ccache_key.c
    * 15069a647ed6c7f1ead42baa1d421d953c9bc557 - kcm: avoid suppression of cppcheck warning
    * e63a15038ac9c186626e4fdf681a6492031d1e40 - kcm: move sec key parser to separate file so it can be shared
    * 9b1631defdcaa3ea7e87889eb136e7fa935ab4ce - kcm: add json suffix to existing searialization functions
    * b6cc661b9f4162e590137430e945aa321fc13121 - iobuf: add more iobuf functions
    * ed08ba0023e63024bf1c52ae3f6596b9d804d0a5 - secrets: accept binary data instead of string
    * 908c15af9a9f8f0556a588e368e4a0b2e24ace1b - secrets: allow to specify secret's data format
    * 74fdaa64b27e88a6e0f153f8cb59989c572d4294 - kcm: avoid multiple debug messages if sss_sec_put fails
    * b8f28d9aa9d862cf504691c9c3f92941a63fb0a4 - kcm: disable encryption
    * 8edcea8c377e85d037e83065c1904fa4b92c4a39 - kcm: avoid name confusion in GET_CRED_UUID_LIST handlers
    * 47a316c850107f12d406f27abb216e26383dfab7 - kcm: fix typos in debug messages

This patches improves the performance significantly, removing bottlenecks in sssd-kcm. However, it's still rather slow when compared to other ccache types. However, further improvements will require changes in KCM protocol which yet needs to be communicated and accepted by the protocol owners.

I will make them available in Fedora shortly.

Comment 77 Pavel Březina 2020-12-04 10:51:30 UTC
(In reply to Roberto Cornacchia from comment #75)
> Please disregard my last comment. It actually isn't true. After removing all
> entries it looks good, also after a login. I had issued the command as root
> (which apparently also needs some cleaning up).

If you manage to reproduce it, please open a separate bug so we can investigate.

Comment 78 Mauricio Teixeira 2020-12-07 18:32:38 UTC
Pavel, if you could share the Fedora's package version when it's available, I would be glad to help testing.

Comment 79 Pavel Březina 2020-12-08 11:02:06 UTC
These patches are now available in Fedora Rawhide (sssd-2.4.0-5) and Fedora 33 (sssd-2.4.0-3; https://bodhi.fedoraproject.org/updates/FEDORA-2020-075cc71cb8). I will keep this BZ open to track further improvements in the KCM protocol.

Comment 80 Pavel Březina 2020-12-11 09:32:08 UTC
Additional patch.

* `master`
    * 18b98836ef8e337992f0ecb239a32b9c3cedb750 - kcm: decode base64 encoded secret on upgrade path

Comment 81 Pavel Březina 2020-12-11 13:54:58 UTC
New update: https://bodhi.fedoraproject.org/updates/FEDORA-2020-b61158b5ff

The previous update did not correctly handle having existing credentials in the ccache.

Comment 82 Mauricio Teixeira 2020-12-16 15:36:31 UTC
No FC22 package? I feel like if I upgrade to 33 it will not be an exact reproduction of the previous behavior.

Comment 83 Mauricio Teixeira 2020-12-16 15:37:43 UTC
Sorry, I mean to task for a FC32 package (not 22)

Comment 84 Pavel Březina 2021-01-07 12:25:21 UTC
I didn't plan to include it in F32 because the changes are quite huge. However, it builds fine, you can try this scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=59111394 If it works well, I will create an update for it.

Comment 85 Ken Dreyer 2021-01-10 00:50:35 UTC
Thanks. I've saved this scratch build (with "koji download-task 59111394") to https://fedorapeople.org/~ktdreyer/bz1645624/koji-task-59111394/ in case koji-gc deletes it.

Comment 86 Johannes Pfau 2021-01-14 14:21:08 UTC
> People that want to revert to keyring for "performance" reasons, can simply dnf uninstall sssd-kcm

Is that supposed to work for AD-enrolled machines? We have some enrolled RHEL 8 multi-user (20-50 users) machines. Changing /etc/krb5.conf.d/kcm_default_ccache to use KEYRING has no effect (kdestroy && su USER && klist still shows a KCM keyring). If I 'dnf remove sssd-kcm' password authentication is broken completely, 'su USER' just fails when given the correct password. Is there anything else that needs to be done after uninstalling sssd-kcm? Clearing the sssd cache or anything similar?

I don't really mind that sssd-kcm burns 100% CPU on one core constantly, but after some uptime of the host it seems to be so overloaded that logins start to fail. I don't really know what in the login process depends on sssd-kcm, but when I tried to debug this problem, I saw endlessly respawning `krb5_child` processes. It seemed like the process was stuck (after "[krb5_child[21032]] [switch_creds] (0x0200): Switch user to ..."), sssd killed it due to a timeout, then restarted the process.

(In our case it's probably not the goa daemon causing the 100% CPU issue though: We have already blacklisted it, as it spams spams the systemd log and makes the log unusable (https://bugzilla.redhat.com/show_bug.cgi?id=1601438)).

Comment 87 Pavel Březina 2021-01-14 14:35:51 UTC
(In reply to Johannes Pfau from comment #86)
> > People that want to revert to keyring for "performance" reasons, can simply dnf uninstall sssd-kcm
> 
> Is that supposed to work for AD-enrolled machines? We have some enrolled

Yes, definitely.

> RHEL 8 multi-user (20-50 users) machines. Changing
> /etc/krb5.conf.d/kcm_default_ccache to use KEYRING has no effect (kdestroy
> && su USER && klist still shows a KCM keyring). If I 'dnf remove sssd-kcm'

What does `echo $KRB5CCNAME` and `cat /etc/krb5.conf` say?

Comment 88 Johannes Pfau 2021-01-14 16:20:30 UTC
Thanks for the suggestions, here's some more information:


> /etc/krb5.conf
I didn't do the initial setup, so I don't know whether this was autogenerated by realm join or whether it has been altered manually...

--------------------------------------------
includedir /etc/krb5.conf.d/

includedir /var/lib/sss/pubconf/krb5.include.d/
[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 dns_lookup_realm = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true
 rdns = false
 default_ccache_name = KEYRING:persistent:%{uid}
 default_realm = KIT.EDU

[realms]
 KIT.EDU = {
 }

[domain_realm]
 kit.edu = KIT.EDU
 .kit.edu = KIT.EDU
--------------------------------------------




> echo $KRB5CCNAME

With sssd-kcm installed:
As normal user, SSH / Public Key login: "Empty".
After su USER: "KCM:"
As root: "KCM:"

If I uninstall sssd-kcm, I can't really get that information, as logins now fail. But if I comment out the lines in "/etc/krb5.conf.d/kcm_default_ccache", restart sssd and relogin, I still get "KCM:".
Maybe rebooting the machine after uninstalling sssd-kcm would help, but I can't test that as there are still some active, long running sessions on that machine I don't want to interrupt.


With sssd-kcm removed, /var/log/sssd/krb5_child.log contains this:
--------------------------------------------
(2021-01-14 16:55:30): [krb5_child[1841552]] [become_user] (0x0200): Trying to become user [790849740][287600513].
(2021-01-14 16:55:31): [krb5_child[1841552]] [k5c_send_data] (0x0200): Received error code 1432158209
--------------------------------------------

When enabling verbose logging:
--------------------------------------------
(2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x0400): krb5_child started.
(2021-01-14 16:59:53): [krb5_child[1844742]] [unpack_buffer] (0x1000): total buffer size: [128]
(2021-01-14 16:59:53): [krb5_child[1844742]] [unpack_buffer] (0x0100): cmd [241] uid [790849740] gid [287600513] validate [true] enterprise principal [true] offline [false] UPN [nq5949]
(2021-01-14 16:59:53): [krb5_child[1844742]] [unpack_buffer] (0x0100): ccname: [KEYRING:persistent:790849740] old_ccname: [KCM:] keytab: [/etc/krb5.keytab]
(2021-01-14 16:59:53): [krb5_child[1844742]] [check_use_fast] (0x0100): Not using FAST.
(2021-01-14 16:59:53): [krb5_child[1844742]] [switch_creds] (0x0200): Switch user to [790849740][287600513].
(2021-01-14 16:59:53): [krb5_child[1844742]] [sss_open_ccache_as_user] (0x0400): ccache KCM: is missing or empty
(2021-01-14 16:59:53): [krb5_child[1844742]] [switch_creds] (0x0200): Switch user to [0][0].
(2021-01-14 16:59:53): [krb5_child[1844742]] [old_ccache_valid] (0x0400): Saved ccache KCM: doesn't exist, ignoring
(2021-01-14 16:59:53): [krb5_child[1844742]] [privileged_krb5_setup] (0x0080): Cannot open the PAC responder socket
(2021-01-14 16:59:53): [krb5_child[1844742]] [become_user] (0x0200): Trying to become user [790849740][287600513].
(2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x2000): Running as [790849740][287600513].
(2021-01-14 16:59:53): [krb5_child[1844742]] [set_lifetime_options] (0x0100): No specific renewable lifetime requested.
(2021-01-14 16:59:53): [krb5_child[1844742]] [set_lifetime_options] (0x0100): No specific lifetime requested.
(2021-01-14 16:59:53): [krb5_child[1844742]] [set_canonicalize_option] (0x0100): Canonicalization is set to [true]
(2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x0400): Will perform online auth
(2021-01-14 16:59:53): [krb5_child[1844742]] [tgt_req_child] (0x1000): Attempting to get a TGT
(2021-01-14 16:59:53): [krb5_child[1844742]] [get_and_save_tgt] (0x0400): Attempting kinit for realm [KIT.EDU]
(2021-01-14 16:59:53): [krb5_child[1844742]] [sss_krb5_expire_callback_func] (0x2000): exp_time: [525869292]
(2021-01-14 16:59:53): [krb5_child[1844742]] [validate_tgt] (0x2000): Found keytab entry with the realm of the credential.
(2021-01-14 16:59:53): [krb5_child[1844742]] [validate_tgt] (0x0400): TGT verified using key for [ITIV-WORK5$@KIT.EDU].
(2021-01-14 16:59:53): [krb5_child[1844742]] [sss_send_pac] (0x0040): sss_pac_make_request failed [-1][2].
(2021-01-14 16:59:53): [krb5_child[1844742]] [validate_tgt] (0x0040): sss_send_pac failed, group membership for user with principal [nq5949\@KIT.EDU] might not be correct.
(2021-01-14 16:59:53): [krb5_child[1844742]] [get_and_save_tgt] (0x2000): Running as [790849740][287600513].
(2021-01-14 16:59:53): [krb5_child[1844742]] [sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [111][Connection refused]
(2021-01-14 16:59:53): [krb5_child[1844742]] [create_ccache] (0x0020): 991: [111][Connection refused]
(2021-01-14 16:59:53): [krb5_child[1844742]] [map_krb5_error] (0x0020): 1833: [111][Connection refused]
(2021-01-14 16:59:53): [krb5_child[1844742]] [k5c_send_data] (0x0200): Received error code 1432158209
(2021-01-14 16:59:53): [krb5_child[1844742]] [pack_response_packet] (0x2000): response packet size: [20]
(2021-01-14 16:59:53): [krb5_child[1844742]] [main] (0x0400): krb5_child completed successfully
--------------------------------------------

Even with sssd-kcm removed, I can sucessfully run kinit (as user, after login with SSH / Public Key):
--------------------------------------------
kinit
Password for nq5949: 
nq5949@itiv-work5 ~ % klist
Ticket cache: KEYRING:persistent:790849740:790849740
Default principal: nq5949

Valid starting     Expires            Service principal
01/14/21 17:13:15  01/15/21 03:13:15  krbtgt/KIT.EDU
	renew until 01/21/21 17:13:12
--------------------------------------------

So kinit seems to acceptthese settings. But going through pam / su is broken for some reason.


Here's the sssd.conf, for reference:
--------------------------------------------
[sssd]
domains = kit.edu
config_file_version = 2
services = nss, pam

[nss]
filter_groups = root
filter_users = root 
enum_cache_timeout = 86400
entry_cache_nowait_percentage = 10

[domain/kit.edu]
ad_domain = kit.edu
krb5_realm = KIT.EDU
realmd_tags = manages-system joined-with-adcli 
cache_credentials = True
id_provider = ad
krb5_store_password_if_offline = True
default_shell = /bin/bash
ldap_id_mapping = True
use_fully_qualified_names = False
override_homedir = /home/%u
fallback_homedir = /home/%u@%d
create_homedir = false
access_provider = simple
auth_provider = ad
simple_allow_groups = [...]
dyndns_update = False
ignore_group_members = False
ldap_use_tokengroups = False
ldap_group_nesting_level = 0
entry_cache_timeout = 86400
entry_cache_user_timeout = 86400
entry_cache_group_timeout = 86400
entry_cache_netgroup_timeout = 86400
refresh_expired_interval = 3600
lookup_family_order = ipv4_only
#debug_level = 0x200
debug_level = 0x3ff0
--------------------------------------------

Comment 89 Johannes Pfau 2021-01-18 11:21:58 UTC
So after uninstalling sssd-kcm and rebooting, the error message in krb5_child.log changed from `[sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [111][Connection refused]` to `[sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [-1750600181][No KCM server found]`.

After some more tests, it seems sssd caches the `default_ccache_name` somewhere, somehow. sssd_cache -E did not seem to solve the problem, but after I deleted `/var/lib/sss/db/*`, password logins started to work again. 

So to summarize:
* Remove sssd-kcm
* Stop sssd
* ssd_cache -E
* rm -rf /var/lib/sss/db/*
* Start sssd


I think the reasons logins started to fail with sssd-kcm is also the krb5_cc_cache_match  call in krb5_child: If sssd-kcm is overloaded for some reason and does not respond early enough, the krb5_child task will timeout and the login will fail. But that's just a guess, I didn't debug that case further.

Comment 90 Pavel Březina 2021-01-18 11:57:55 UTC
SSSD should be able to handle this change, at least when new user session is created. If not then it's a bug, I will test this.

Comment 91 Alexander Bokovoy 2021-03-05 07:47:23 UTC
I have the same problem coming back on F33 as well with sssd 2.4.2-2.fc33

I installed debuginfo and ran few pstack runs:

Installed:
  sssd-debuginfo-2.4.2-2.fc33.x86_64                                                sssd-debugsource-2.4.2-2.fc33.x86_64                                                sssd-kcm-debuginfo-2.4.2-2.fc33.x86_64                                               

Complete!
[root@pinega ~]# pstack $(pidof sssd_kcm)
#0  0x00007f70b0c659f5 in tdb_next_hash_chain () from /lib64/libtdb.so.1
#1  0x00007f70b0c69dc6 in tdb_next_lock () from /lib64/libtdb.so.1
#2  0x00007f70b0c69ffe in tdb_traverse_internal () from /lib64/libtdb.so.1
#3  0x00007f70b0c6a22d in tdb_traverse_read () from /lib64/libtdb.so.1
#4  0x00007f70a31c9cdd in ltdb_traverse_fn () from /usr/lib64/ldb/libldb-tdb-int.so
#5  0x00007f70a3001751 in ldb_kv_search () from /usr/lib64/ldb/libldb-key-value.so
#6  0x00007f70a3003288 in ldb_kv_callback () from /usr/lib64/ldb/libldb-key-value.so
#7  0x00007f70b12f9939 in tevent_common_invoke_timer_handler () from /lib64/libtevent.so.0
#8  0x00007f70b12f9adc in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0
#9  0x00007f70b12fa2c8 in epoll_event_loop_once () from /lib64/libtevent.so.0
#10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#12 0x00007f70b0cecc3b in ldb_wait () from /lib64/libldb.so.2
#13 0x00007f70b0cee114 in ldb_search () from /lib64/libldb.so.2
#14 0x00007f70b134215b in sss_sec_list (_num_keys=0x7ffeeb742dc0, _keys=0x7ffeeb742dc8, req=0x56195f13a050, mem_ctx=0x56195f13fc30) at src/util/secrets/secrets.c:964
#15 sss_sec_list (mem_ctx=0x56195f13fc30, req=0x56195f13a050, _keys=0x7ffeeb742dc8, _num_keys=0x7ffeeb742dc0) at src/util/secrets/secrets.c:940
#16 0x000056195ebd1737 in key_by_uuid (mem_ctx=0x56195f13fbc0, sctx=0x56195f13a860, client=<optimized out>, uuid=0x56195fb58110 "\205\257\361%\244\066J\221\263=\356݄F\004QB\202\030\020~\303H!\233\063#ɃH^\353\020\310!\a\005\342N\004\202a\277\300\244\006O6\312\374\025\234\275lOY\271\065`\262\212\\m\372[\300\242\274\250jK\354\214.\222b\271\271\022\f\326\364b,J\332N\272\241\250wH\302&", _key=0x7ffeeb742e68) at src/responder/kcm/kcmsrv_ccache_secdb.c:362
#17 0x000056195ebd19b5 in ccdb_secdb_name_by_uuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=<optimized out>, client=0x561960551710, uuid=0x56195fb58110 "\205\257\361%\244\066J\221\263=\356݄F\004QB\202\030\020~\303H!\233\063#ɃH^\353\020\310!\a\005\342N\004\202a\277\300\244\006O6\312\374\025\234\275lOY\271\065`\262\212\\m\372[\300\242\274\250jK\354\214.\222b\271\271\022\f\326\364b,J\332N\272\241\250wH\302&") at src/responder/kcm/kcmsrv_ccache_secdb.c:1046
#18 0x000056195ebd5d02 in kcm_ccdb_name_by_uuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=0x56195f13a980, client=0x561960551710, uuid=0x56195fb58110 "\205\257\361%\244\066J\221\263=\356݄F\004QB\202\030\020~\303H!\233\063#ɃH^\353\020\310!\a\005\342N\004\202a\277\300\244\006O6\312\374\025\234\275lOY\271\065`\262\212\\m\372[\300\242\274\250jK\354\214.\222b\271\271\022\f\326\364b,J\332N\272\241\250wH\302&") at src/responder/kcm/kcmsrv_ccache.c:952
#19 0x000056195ebdbe47 in kcm_op_get_default_ccache_list_done (subreq=0x0) at src/responder/kcm/kcmsrv_ops.c:1787
#20 0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0
#21 0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0
#22 0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0
#23 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#24 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#25 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
#26 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0
#27 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so
#28 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332
[root@pinega ~]# pstack $(pidof sssd_kcm)
#0  0x00007f70b114d6ae in epoll_ctl () from /lib64/libc.so.6
#1  0x00007f70b12f57d7 in epoll_update_event () from /lib64/libtevent.so.0
#2  0x00007f70b12f61e4 in epoll_event_set_fd_flags () from /lib64/libtevent.so.0
#3  0x000056195ebcb65e in kcm_send_reply (req_ctx=0x56195fb5f3e0) at src/responder/kcm/kcmsrv_cmd.c:366
#4  kcm_cmd_request_done (req=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:415
#5  0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0
#6  0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0
#7  0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0
#8  0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#9  0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#10 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
#11 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0
#12 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so
#13 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332
[root@pinega ~]# pstack $(pidof sssd_kcm)
#0  0x00007f70b11b0a71 in __memset_avx2_erms () from /lib64/libc.so.6
#1  0x00007f70b12e3584 in _talloc_zero () from /lib64/libtalloc.so.2
#2  0x000056195ebd96bd in sss_iobuf_init_empty (mem_ctx=<optimized out>, size=10485760, capacity=10485760) at src/util/sss_iobuf.c:33
#3  0x000056195ebcd3dd in kcm_cmd_send (op=0x56195ebed200 <kcm_optable+480>, input=0x56195f139ed8, client=0x56195fb466e0, conn_data=0x56195fb5af70, kcm_data=<optimized out>, qctx=<optimized out>, ev=0x56195f12bfd0, mem_ctx=0x56195f139e30) at src/responder/kcm/kcmsrv_ops.c:117
#4  kcm_cmd_dispatch (req_ctx=0x56195f139e30, kctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:381
#5  kcm_recv (cctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:553
#6  sss_client_fd_handler (recv_fn=<optimized out>, send_fn=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:1097
#7  kcm_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:628
#8  0x00007f70b12f6a4d in tevent_common_invoke_fd_handler () from /lib64/libtevent.so.0
#9  0x00007f70b12fa4e7 in epoll_event_loop_once () from /lib64/libtevent.so.0
#10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#12 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
#13 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0
#14 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so
#15 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332
[root@pinega ~]# pstack $(pidof sssd_kcm)
#0  0x00007f70b11b0a71 in __memset_avx2_erms () from /lib64/libc.so.6
#1  0x00007f70b12e3584 in _talloc_zero () from /lib64/libtalloc.so.2
#2  0x000056195ebd96bd in sss_iobuf_init_empty (mem_ctx=<optimized out>, size=10485752, capacity=10485752) at src/util/sss_iobuf.c:33
#3  0x000056195ebcd4b4 in kcm_cmd_send (op=<optimized out>, input=0x56195fb5f488, client=0x56195f13ff40, conn_data=<optimized out>, kcm_data=<optimized out>, qctx=<optimized out>, ev=0x56195f12bfd0, mem_ctx=0x56195fb5f3e0) at src/responder/kcm/kcmsrv_ops.c:161
#4  kcm_cmd_dispatch (req_ctx=0x56195fb5f3e0, kctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:381
#5  kcm_recv (cctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:553
#6  sss_client_fd_handler (recv_fn=<optimized out>, send_fn=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:1097
#7  kcm_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:628
#8  0x00007f70b12f6a4d in tevent_common_invoke_fd_handler () from /lib64/libtevent.so.0
#9  0x00007f70b12fa4e7 in epoll_event_loop_once () from /lib64/libtevent.so.0
#10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#12 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
#13 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0
#14 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so
#15 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332
[root@pinega ~]# pstack $(pidof sssd_kcm)
#0  0x00007f70b11b0666 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x00007f70b0c67ad1 in tdb_read () from /lib64/libtdb.so.1
#2  0x00007f70b0c6a041 in tdb_traverse_internal () from /lib64/libtdb.so.1
#3  0x00007f70b0c6a22d in tdb_traverse_read () from /lib64/libtdb.so.1
#4  0x00007f70a31c9cdd in ltdb_traverse_fn () from /usr/lib64/ldb/libldb-tdb-int.so
#5  0x00007f70a3001751 in ldb_kv_search () from /usr/lib64/ldb/libldb-key-value.so
#6  0x00007f70a3003288 in ldb_kv_callback () from /usr/lib64/ldb/libldb-key-value.so
#7  0x00007f70b12f9939 in tevent_common_invoke_timer_handler () from /lib64/libtevent.so.0
#8  0x00007f70b12f9adc in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0
#9  0x00007f70b12fa2c8 in epoll_event_loop_once () from /lib64/libtevent.so.0
#10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#12 0x00007f70b0cecc3b in ldb_wait () from /lib64/libldb.so.2
#13 0x00007f70b0cee114 in ldb_search () from /lib64/libldb.so.2
#14 0x00007f70b134215b in sss_sec_list (_num_keys=0x7ffeeb742d70, _keys=0x7ffeeb742d78, req=0x56195fb67ff0, mem_ctx=0x56195f13fc30) at src/util/secrets/secrets.c:964
#15 sss_sec_list (mem_ctx=0x56195f13fc30, req=0x56195fb67ff0, _keys=0x7ffeeb742d78, _num_keys=0x7ffeeb742d70) at src/util/secrets/secrets.c:940
#16 0x000056195ebd1737 in key_by_uuid (mem_ctx=0x56195f13fbc0, sctx=0x56195f13a860, client=<optimized out>, uuid=0x7ffeeb742e80 "\205\257\361%\244\066J\221\263=\356݄F\004Q\300\360\023_\031V", _key=0x7ffeeb742e18) at src/responder/kcm/kcmsrv_ccache_secdb.c:362
#17 0x000056195ebd2a6e in ccdb_secdb_getbyuuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=<optimized out>, client=0x56195f13f710, uuid=0x7ffeeb742e80 "\205\257\361%\244\066J\221\263=\356݄F\004Q\300\360\023_\031V") at src/responder/kcm/kcmsrv_ccache_secdb.c:922
#18 0x000056195ebd9c05 in kcm_ccdb_getbyuuid_send (uuid=0x7ffeeb742e80 "\205\257\361%\244\066J\221\263=\356݄F\004Q\300\360\023_\031V", client=0x56195f13f710, db=0x56195f13a980, ev=0x56195f12bfd0, mem_ctx=<optimized out>) at src/responder/kcm/kcmsrv_ccache.c:859
#19 kcm_op_get_cache_by_uuid_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, op_ctx=0x56195fb4f1d0) at src/responder/kcm/kcmsrv_ops.c:1562
#20 0x000056195ebcc03a in kcm_cmd_queue_done (subreq=0x0) at src/responder/kcm/kcmsrv_ops.c:201
#21 0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0
#22 0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0
#23 0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0
#24 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#25 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#26 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
#27 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0
#28 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so
#29 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332
[root@pinega ~]# pstack $(pidof sssd_kcm)
#0  0x00007f70b1119dfb in geteuid () from /lib64/libc.so.6
#1  0x00007f70b1341c89 in sss_sec_new_req (_req=0x7ffeeb742dd0, client=0, url=<optimized out>, sec_ctx=0x56195f13a860, mem_ctx=0x56195f13f1f0) at src/util/secrets/secrets.c:865
#2  sss_sec_new_req (mem_ctx=0x56195f13f1f0, sec_ctx=0x56195f13a860, url=<optimized out>, client=0, _req=0x7ffeeb742dd0) at src/util/secrets/secrets.c:808
#3  0x000056195ebd128e in secdb_dfl_url_req (mem_ctx=0x56195f13f180, sctx=0x56195f13a860, client=<optimized out>, _sreq=0x7ffeeb742e28) at src/responder/kcm/kcmsrv_ccache_secdb.c:326
#4  0x000056195ebd13ce in ccdb_secdb_get_default_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=<optimized out>, client=0x56196054daa0) at src/responder/kcm/kcmsrv_ccache_secdb.c:775
#5  0x000056195ebd4cf3 in kcm_ccdb_get_default_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, db=0x56195f13a980, client=0x56196054daa0) at src/responder/kcm/kcmsrv_ccache.c:546
#6  0x000056195ebd7c75 in kcm_op_get_default_ccache_send (mem_ctx=<optimized out>, ev=0x56195f12bfd0, op_ctx=<optimized out>) at src/responder/kcm/kcmsrv_ops.c:1657
#7  0x000056195ebcc03a in kcm_cmd_queue_done (subreq=0x0) at src/responder/kcm/kcmsrv_ops.c:201
#8  0x00007f70b12f6c1a in tevent_common_invoke_immediate_handler () from /lib64/libtevent.so.0
#9  0x00007f70b12f6c3a in tevent_common_loop_immediate () from /lib64/libtevent.so.0
#10 0x00007f70b12fa2bc in epoll_event_loop_once () from /lib64/libtevent.so.0
#11 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#12 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#13 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
#14 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0
#15 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so
#16 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332
[root@pinega ~]# pstack $(pidof sssd_kcm)
#0  0x00007f70b11b0a71 in __memset_avx2_erms () from /lib64/libc.so.6
#1  0x00007f70b12e3584 in _talloc_zero () from /lib64/libtalloc.so.2
#2  0x000056195ebd96bd in sss_iobuf_init_empty (mem_ctx=<optimized out>, size=10485760, capacity=10485760) at src/util/sss_iobuf.c:33
#3  0x000056195ebcd3dd in kcm_cmd_send (op=0x56195ebed1e8 <kcm_optable+456>, input=0x56195fb68da8, client=0x56195fb6a7d0, conn_data=0x5619605487c0, kcm_data=<optimized out>, qctx=<optimized out>, ev=0x56195f12bfd0, mem_ctx=0x56195fb68d00) at src/responder/kcm/kcmsrv_ops.c:117
#4  kcm_cmd_dispatch (req_ctx=0x56195fb68d00, kctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:381
#5  kcm_recv (cctx=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:553
#6  sss_client_fd_handler (recv_fn=<optimized out>, send_fn=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/common/responder_common.c:1097
#7  kcm_fd_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, ptr=<optimized out>) at src/responder/kcm/kcmsrv_cmd.c:628
#8  0x00007f70b12f6a4d in tevent_common_invoke_fd_handler () from /lib64/libtevent.so.0
#9  0x00007f70b12fa4e7 in epoll_event_loop_once () from /lib64/libtevent.so.0
#10 0x00007f70b12f2f57 in std_event_loop_once () from /lib64/libtevent.so.0
#11 0x00007f70b12f6414 in _tevent_loop_once () from /lib64/libtevent.so.0
#12 0x00007f70b12f650b in tevent_common_loop_wait () from /lib64/libtevent.so.0
#13 0x00007f70b12f2fc7 in std_event_loop_wait () from /lib64/libtevent.so.0
#14 0x00007f70b13a9317 in server_loop () from /usr/lib64/sssd/libsss_util.so
#15 0x000056195ebc9bae in main (argc=<optimized out>, argv=<optimized out>) at src/responder/kcm/kcm.c:332

Comment 92 Pavel Březina 2021-04-06 09:01:29 UTC
Further performance improvements where just pushed upstream:
sssd: https://github.com/SSSD/sssd/pull/5546
krb5: https://github.com/krb5/krb5/pull/1165

I will coordinate with Kerberos maintainer to get these changes in, since it requires changes in both sssd and krb5.

Comment 93 Pavel Březina 2021-04-08 13:20:23 UTC
*** Bug 1945966 has been marked as a duplicate of this bug. ***

Comment 94 Fedora Program Management 2021-04-29 17:12:32 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 95 Jeremy Whiting 2021-05-13 14:12:39 UTC
Created attachment 1782776 [details]
Strace attached to sssd_kcm on an idle system.

Idle system strace histogram.

Comment 96 Jeremy Whiting 2021-05-13 14:14:17 UTC
Created attachment 1782777 [details]
Strace attached to sssd_kcm on a system with ssh establishment.

strace histogram when x4 ssh sessions are being established.

Comment 97 Jeremy Whiting 2021-05-13 14:16:12 UTC
 I came across this issue with F33 using sssd_kcm 2.4.2. Have been able to re-create the "single-core high consumption" issue repeatedly.

klist shows 1 expired KCM entry.
$ klist -l
 Principal name                 Cache name
--------------                 ----------
xxxxxx.xxxxx     KCM:1000 (Expired)
$

 I noticed on an old but not slow system the CPU consumption by sssd_kcm was pinned at 100% when I was establishing an ssh session. Taking 35 seconds. Which is bad enough because usually this completes in under 3 seconds. This decrease in performance has recently occured since installing and using kinit on a project.
 Then I ran this test:

 In 4 separate terminal windows I started a new ssh session to x4 separate hosts at the same time.

 This test was revealing. A ssh session establishment duration increased to 1 minute 40 seconds.

 Attached are x2 strace histograms for a measurement period of approx 1 minute 40 secs. The first is sssd_kcm with an idle system. The second is when the x4 ssh sessions are being established.
 Comparing each shows a stark difference in the number of system calls.

$ uname -a
Linux burtha-f33 5.8.15-301.fc33.x86_64 #1 SMP Thu Oct 15 16:58:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu|grep name
Model name:                      Intel(R) Core(TM) i7 CPU       X 990  @ 3.47GHz
$

Comment 98 Debarshi Ray 2021-05-20 18:23:08 UTC
*** Bug 1844982 has been marked as a duplicate of this bug. ***

Comment 99 Jeremy Whiting 2021-05-26 08:38:54 UTC
 sssd v2.5.0 includes the patch. Not found a release of krb5 yet that includes the merged patch. I could be wrong of course.
@Pavel Březina how is the discussion going with the maintainers to get the krb5 patch released ?

Comment 100 Pavel Březina 2021-05-26 10:24:29 UTC
F34 and Rawhide: 1.19.1-6

The maintainer does not want to add it to Fedora 33.

Comment 101 Steven Ellis 2021-06-09 00:43:24 UTC
I'm seeing an issue similar to https://bugzilla.redhat.com/show_bug.cgi?id=1945966

This is filling up my logs on a regular basis with sssd_kcm backtrace dumps

Looking at the process with strace it appears to be interacting with "/usr/libexec/goa-identity-service" for user 500

I do have a couple of expired logins in Gnome Online Accounts, but I wouldn't expect /var/log to be filled up with SSSD backtraces.


(2021-06-09 12:34:21): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2021-06-09 12:34:21): [kcm] [ccdb_secdb_list_send] (0x2000): Listing all ccaches
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_map_path] (0x1000): Mapping prefix /kcm/
   *  (2021-06-09 12:34:21): [kcm] [kcm_map_url_to_path] (0x1000): User-specific KCM path is [/kcm/persistent/500/ccache/]
   *  (2021-06-09 12:34:21): [kcm] [local_db_dn] (0x2000): Local path for [persistent/500/ccache/] is [cn=ccache,cn=500,cn=persistent,cn=kcm]
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_new_req] (0x1000): Local DB path is persistent/500/ccache/
   *  (2021-06-09 12:34:21): [kcm] [secdb_container_url_req] (0x2000): Created request for URL /kcm/persistent/500/ccache/
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_list] (0x0400): Listing keys at [persistent/500/ccache/]
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_list] (0x2000): Searching for [(|(type=simple)(type=binary))] at [cn=ccache,cn=500,cn=persistent,cn=kcm] with scope=subtree
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_list] (0x1000): No secrets found
   *  (2021-06-09 12:34:21): [kcm] [ccdb_secdb_list_send] (0x2000): Found 0 ccaches
   *  (2021-06-09 12:34:21): [kcm] [ccdb_secdb_list_send] (0x2000): Listing all caches done
   *  (2021-06-09 12:34:21): [kcm] [kcm_op_get_default_ccache_reply_step] (0x2000): The default ccache is 500
   *  (2021-06-09 12:34:21): [kcm] [kcm_cmd_done] (0x0400): KCM operation GET_DEFAULT_CACHE returned [0]: Success
   *  (2021-06-09 12:34:21): [kcm] [kcm_send_reply] (0x2000): Sending a reply
   *  (2021-06-09 12:34:21): [kcm] [kcm_output_construct] (0x1000): Sending a reply with 8 bytes of payload
   *  (2021-06-09 12:34:21): [kcm] [queue_removal_cb] (0x0200): Removed queue for 500 
   *  (2021-06-09 12:34:21): [kcm] [kcm_send] (0x2000): All data sent!
   *  (2021-06-09 12:34:21): [kcm] [get_client_cred] (0x4000): Client [0x563783a5e1b0][15] creds: euid[500] egid[500] pid[3704] cmd_line['krb5-auth-dialog'].
   *  (2021-06-09 12:34:21): [kcm] [setup_client_idle_timer] (0x4000): Idle timer re-set for client [0x563783a5e1b0][15]
   *  (2021-06-09 12:34:21): [kcm] [accept_fd_handler] (0x0400): Client [0x563783a5e1b0][15] connected!
   *  (2021-06-09 12:34:21): [kcm] [kcm_input_parse] (0x1000): Received message with length 4
   *  (2021-06-09 12:34:21): [kcm] [kcm_get_opt] (0x2000): The client requested operation 20
   *  (2021-06-09 12:34:21): [kcm] [kcm_cmd_send] (0x0400): KCM operation GET_DEFAULT_CACHE
   *  (2021-06-09 12:34:21): [kcm] [kcm_cmd_send] (0x1000): 0 bytes on KCM input
   *  (2021-06-09 12:34:21): [kcm] [kcm_op_queue_send] (0x0200): Adding request by 500 to the wait queue
   *  (2021-06-09 12:34:21): [kcm] [kcm_op_queue_get] (0x1000): No existing queue for this ID
   *  (2021-06-09 12:34:21): [kcm] [kcm_op_queue_send] (0x1000): Queue was empty, running the request immediately
   *  (2021-06-09 12:34:21): [kcm] [kcm_op_get_default_ccache_send] (0x1000): Getting client's default ccache
   *  (2021-06-09 12:34:21): [kcm] [ccdb_secdb_get_default_send] (0x2000): Getting the default ccache
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_map_path] (0x1000): Mapping prefix /kcm/
   *  (2021-06-09 12:34:21): [kcm] [kcm_map_url_to_path] (0x1000): User-specific KCM path is [/kcm/persistent/500/default]
   *  (2021-06-09 12:34:21): [kcm] [local_db_dn] (0x2000): Local path for [persistent/500/default] is [cn=default,cn=500,cn=persistent,cn=kcm]
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_new_req] (0x1000): Local DB path is persistent/500/default
   *  (2021-06-09 12:34:21): [kcm] [secdb_dfl_url_req] (0x2000): Created request for URL /kcm/persistent/500/default
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_get] (0x0400): Retrieving a secret from [persistent/500/default]
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_get] (0x2000): Searching for [(|(type=simple)(type=binary))] at [cn=default,cn=500,cn=persistent,cn=kcm] with scope=base
   *  (2021-06-09 12:34:21): [kcm] [sss_sec_get] (0x1000): No secret found
   *  (2021-06-09 12:34:21): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory
********************** BACKTRACE DUMP ENDS HERE *********************************



I did have an expired krb5 credential as user 500 which I've now removed with kdestroy.

Comment 102 Pavel Březina 2021-10-15 12:38:26 UTC
We just released sssd-2.6.0 that contains a KCM fix to make sure that ccache does not pile up with expired tickets. In particular this change:

"KCM now replace the old credential with new one when storing an updated credential that is however already present in the ccache to avoid unnecessary growth of the ccache."
https://github.com/SSSD/sssd/commit/55c5de2d5c84721ca733e2b593894ffe904cdfb4

I believe this is the last thing that lead to performance issues. It is currently in rawhide and F35 update https://bodhi.fedoraproject.org/updates/FEDORA-2021-360425682d. I will also backport it to F34 once it gets some testing.

I'm closing this BZ as rawhide. Feel free to open new BZ if you hit another performance issues.

Comment 103 James 2021-10-15 21:34:33 UTC
I've just installed sssd-kcm-2.6.0-1.fc35.x86_64, rebooted, logged into GNOME and in top sssd_kcm is sitting at 93% CPU, with goa-identity-service at 16%. Is there another element I need to upgrade to see the fix?

Comment 104 Pavel Březina 2021-10-18 10:00:35 UTC
How many tickets are there in the output of 'klist -A -C'? It is possible that your ccache is still large (with expired credentials piled up) and you need to kdestroy && kinit.

Comment 105 James 2021-10-22 21:02:54 UTC
I'm really not convinced this is fixed. I've cleaned the cache with kdestroy and logged back into GNOME. For the first few days things are fine, but now I've got (from top):

    3493 root      20   0  270396  40164   8220 R  97.3   0.3   1506:04 sssd_kcm  

goa-identity-service seems to be waking up frequently.

There's not much in klist -A -C:

Ticket cache: KCM:1367600004:74262
Default principal: james

Valid starting     Expires            Service principal
22/10/21 21:30:24  23/10/21 21:23:36  krbtgt/CB.ETTLE
22/10/21 21:30:53  23/10/21 21:23:36  host/sparky.cb.ettle

Ticket cache: KCM:1367600004
Default principal: james

Valid starting     Expires            Service principal
20/10/21 22:39:24  21/10/21 21:49:09  krbtgt/CB.ETTLE

Comment 106 Pavel Březina 2021-10-25 12:06:33 UTC
James, can you please attach KCM logs? See https://bugzilla.redhat.com/show_bug.cgi?id=1645624#c41 how to obtain them.

GOA will still poll kcm every 5 seconds so there still will be some activity until we implement notifications. But 97% of CPU activity seems strange with so small ccache.

Comment 107 James 2021-10-25 21:43:11 UTC
Created attachment 1837017 [details]
Last 1000 lines of level-0x3ff0 sssd_kcm.log for sssd-kcm-2.6.0-1.fc35

I reinstalled sssd-kcm and restarted with logging configured as described in comment 41. sssd-kcm was again running with 97-98% CPU, klist -A -C containing the same number of entries. Within a few minutes, the log file was ~170 MiB big so I've only attached the last 1000 lines.

Comment 108 Pavel Březina 2021-10-26 09:44:18 UTC
KCM is behaving correctly at this point and the overall performance is good, I think. However, your logs show that there are multiple goa request in the single seconds (4-5 in this log), each request triggers multiple KCM operations (which is expected). We definitely want to implement notification mechanism that will prevent goa from constant polling, however in my opinion we should see only single request every five seconds, not multiple requests each second. Please attach a full log, if you will zip it it should shrink significantly.

Comment 109 James 2021-10-26 13:17:07 UTC
Created attachment 1837187 [details]
level-0x3ff0 sssd_kcm.log for sssd-kcm-2.6.0-1.fc35

Full debug log attached.

Comment 110 Pavel Březina 2021-11-01 10:25:42 UTC
Thank you James.

22:32:22 - looks like you logged in to the system, since krb5_child from sssd that is responsible for obtaining tgt for you talks to kcm
(idle)
22:32:27 - goa kicks in, this is the first poll from goa, it finishes fast and looks correct
(idle)
22:32:32 - another poll from goa, but it starts an unending communication between goa and kcm which goes to the end of logs at 22:34:27, goa sends more and more requests and kcm never becomes idle again. This generates the cpu load.

Derbarshi, can you please advice how to get logs from goa? Can you point me to relevant code in goa?

Comment 111 Debarshi Ray 2021-11-01 20:27:59 UTC
You might already be seeing the logs in the systemd journal.

To increase the logging level, you can set the G_MESSAGES_DEBUG environment variable to 'all', and restart the GOA daemons. See:
https://wiki.gnome.org/Projects/GnomeOnlineAccounts/Debugging

In this case, I suspect the goa-identity-service process.

GOA is a relatively small code-base. You could start from here:
https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/tree/master/src/goaidentity

Comment 112 Pavel Březina 2021-11-02 13:40:18 UTC
Uhm, I see we get to refresh_identities from poll timeout but I can't really follow without deeper understanding of the code since there is too much goa internals involved. James, can you please try to get the goa logs?

Comment 113 Sam Morris 2022-06-11 19:24:35 UTC
I have had this happen to me a couple of times. I filed https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79 against GOA and did a bit of investigation. It looks like GOA makes the assumption that you'll only ever have a single ccache for each principal in a ccache collection; when this assumption does not hold, GOA gets confused and starts spamming the KCM with requests.

Comment 114 Alberto Ruiz 2022-06-16 15:20:44 UTC
(In reply to Sam Morris from comment #113)
> I have had this happen to me a couple of times. I filed
> https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79 against GOA
> and did a bit of investigation. It looks like GOA makes the assumption that
> you'll only ever have a single ccache for each principal in a ccache
> collection; when this assumption does not hold, GOA gets confused and starts
> spamming the KCM with requests.

Well at least now we know who the culprit is, thanks for the pointer!

Comment 115 Pavel Březina 2022-06-17 10:08:05 UTC
Thank you for looking into it.

Comment 116 Sam Morris 2022-11-26 17:27:55 UTC
See https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79#note_1603310 for a reproducer. I suggest re-opening issue 1844982, or reopening this issue and reassigning it to gnome-online-accounts.

Comment 117 James 2023-05-12 13:40:03 UTC
I seem to be on a NEEDINFO for this but as of a few months now I've not been seeing any excessive CPU drain from sssd-kcm.


Note You need to log in before you can comment on or make changes to this bug.