Bug 2035496
| Summary: | High CPU utilization by the sssd_kcm process Despite RHBA-2021:1666 Being deployed | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Brian McCafferty <bdm> |
| Component: | sssd | Assignee: | Alejandro López <allopez> |
| Status: | CLOSED MIGRATED | QA Contact: | Jakub Vavra <jvavra> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.5 | CC: | aboscatt, atikhono, captainjava, chuchang, grajaiya, jstephen, lslebodn, mzidek, pbrezina, pdwyer, sebastiano.pomata, skudupud, swachira, tscherf |
| Target Milestone: | rc | Keywords: | MigratedToJIRA, Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-18 23:53:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Brian McCafferty
2021-12-24 09:25:40 UTC
(In reply to Brian McCafferty from comment #0) > > - Customer was able to get sssd_kcm running at 100% CPU again. Just "klist" > seems to do the trick, listing entries very slowly, while sssd_kcm is 100% Please provide `sssd_kcm.log` with debug_level=9 covering this moment. ``` * (2021-12-16 8:03:28): [kcm] [kcm_cmd_send] (0x0400): KCM operation GET_CRED_LIST ... * (2021-12-16 8:03:28): [kcm] [sss_sec_get] (0x2000): Searching for [(|(type=simple)(type=binary))] at [cn=2ce94712-4f77-4c58-939f-6fb4c6ca9b83-...,cn=ccache,cn=...,cn=persistent,cn=kcm] with scope=base * (2021-12-16 8:03:28): [kcm] [secdb_get_cc] (0x2000): Fetched the ccache * (2021-12-16 8:03:28): [kcm] [ccdb_secdb_getbyname_send] (0x2000): Got ccache by name * (2021-12-16 8:03:28): [kcm] [kcm_cmd_done] (0x0040): op receive function failed [105]: No buffer space available ``` It seems cache size exceeds: ``` #define KCM_REPLY_MAX 10*1024*1024 ``` What is the size of /var/lib/sss/secrets/secrets.ldb (when issue is observed)? Could you please provide (as a private attachment) output of ``` ldbsearch -H /var/lib/sss/secrets/secrets.ldb "(|(type=simple)(type=binary))" -b "cn=2ce94...,cn=ccache,cn=...,cn=persistent,cn=kcm" ``` where -b "..." corresponds to "Searching for ..." in the log before "No buffer space available"? Plus corresponding sssd_kcm_log. Merely `ldbsearch -H /var/lib/sss/secrets/secrets.ldb` output also works for me, but it seems it might be huge. Thank you for the logs. I confirm this is an issue: when entire ccache grows over default KCM protocol message limit (10mb) operation 'GET_CRED_LIST' just fails with 'ENOBUFS' A potential solution could be to check size of ccache before reading and return 'unsupported' in case limit overflow: this should avoid CPU consumption for unused data read and allow libkrb5 to fallback to 'GET_CRED_UUID_LIST' gracefully. Another things is the *reason* why ccache grows so huge. Looking at the comments in the case (a lot of "config: refresh_time = ..." lines in `klist -A -C` out) and based on Pavel's comment, most probably the reason is https://github.com/SSSD/sssd/issues/5775 This particular issue is fixed in sssd-2.6, i.e. in RHEL 8.6 Most probably this patch will help to avoid hitting issue described in comment 11 Running kdestroy from time to time will help. Or they can switch to different ccache type (eg keyring). I can confirm this happens on sssd 2.5.2, sssd_kcm goes to the 100% of one CPU and the secrets.ldb file is ~122 MB. The only workaround I've found so far is stopping sssd-kcm.socket, deleting the secrets.ldb and restarting. Here's sssd_kcm.log relevant part (2022-02-07 10:01:27): [kcm] [kcm_cmd_request_done] (0x0040): KCM operation failed [105]: No buffer space available (2022-02-07 10:01:27): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [105]: No buffer space available ********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE: * (2022-02-07 10:01:27): [kcm] [kcm_cmd_request_done] (0x0040): KCM operation failed [105]: No buffer space available * (2022-02-07 10:01:27): [kcm] [kcm_reply_error] (0x0040): KCM operation returns failure [105]: No buffer space available I can confirm the issue on sssd 2.5.2-2. Debug output: * (2022-03-23 14:54:04): [kcm] [kcm_op_queue_get] (0x1000): No existing queue for this ID * (2022-03-23 14:54:04): [kcm] [kcm_op_queue_send] (0x1000): Queue was empty, running the request immediately * (2022-03-23 14:54:04): [kcm] [kcm_op_get_default_ccache_send] (0x1000): Getting client's default ccache * (2022-03-23 14:54:04): [kcm] [ccdb_secdb_get_default_send] (0x2000): Getting the default ccache * (2022-03-23 14:54:04): [kcm] [sss_sec_map_path] (0x1000): Mapping prefix /kcm/ * (2022-03-23 14:54:04): [kcm] [kcm_map_url_to_path] (0x1000): User-specific KCM path is [/kcm/persistent/1000/default] * (2022-03-23 14:54:04): [kcm] [local_db_dn] (0x2000): Local path for [persistent/1000/default] is [cn=default,cn=1000,cn=persistent,cn=kcm] * (2022-03-23 14:54:04): [kcm] [sss_sec_new_req] (0x1000): Local DB path is persistent/1000/default * (2022-03-23 14:54:04): [kcm] [secdb_dfl_url_req] (0x2000): Created request for URL /kcm/persistent/1000/default * (2022-03-23 14:54:04): [kcm] [sss_sec_get] (0x0400): Retrieving a secret from [persistent/1000/default] * (2022-03-23 14:54:04): [kcm] [sss_sec_get] (0x2000): Searching for [(|(type=simple)(type=binary))] at [cn=default,cn=1000,cn=persistent,cn=kcm] with scope=base * (2022-03-23 14:54:04): [kcm] [local_decrypt] (0x2000): Decrypting with masterkey * (2022-03-23 14:54:04): [kcm] [ccdb_secdb_get_default_send] (0x2000): Got the default ccache * (2022-03-23 14:54:04): [kcm] [ccdb_secdb_name_by_uuid_send] (0x2000): Translating UUID to name * (2022-03-23 14:54:04): [kcm] [sss_sec_map_path] (0x1000): Mapping prefix /kcm/ * (2022-03-23 14:54:04): [kcm] [kcm_map_url_to_path] (0x1000): User-specific KCM path is [/kcm/persistent/1000/ccache/] * (2022-03-23 14:54:04): [kcm] [local_db_dn] (0x2000): Local path for [persistent/1000/ccache/] is [cn=ccache,cn=1000,cn=persistent,cn=kcm] * (2022-03-23 14:54:04): [kcm] [sss_sec_new_req] (0x1000): Local DB path is persistent/1000/ccache/ * (2022-03-23 14:54:04): [kcm] [secdb_container_url_req] (0x2000): Created request for URL /kcm/persistent/1000/ccache/ * (2022-03-23 14:54:04): [kcm] [sss_sec_list] (0x0400): Listing keys at [persistent/1000/ccache/] * (2022-03-23 14:54:04): [kcm] [sss_sec_list] (0x2000): Searching for [(|(type=simple)(type=binary))] at [cn=ccache,cn=1000,cn=persistent,cn=kcm] with scope=subtree * (2022-03-23 14:54:04): [kcm] [local_dn_to_path] (0x2000): Secrets path for [cn=592860fe-e861-4f91-8537-e026138f66eb-1000,cn=ccache,cn=1000,cn=persistent,cn=kcm] is [592860fe-e861-4f91-8537-e026138f66eb-1000] * (2022-03-23 14:54:04): [kcm] [sss_sec_list] (0x1000): Returning 1 secrets * (2022-03-23 14:54:04): [kcm] [key_by_uuid] (0x2000): Found key 592860fe-e861-4f91-8537-e026138f66eb-1000 * (2022-03-23 14:54:04): [kcm] [ccdb_secdb_name_by_uuid_send] (0x2000): Got ccache by UUID * (2022-03-23 14:54:04): [kcm] [kcm_op_get_default_ccache_reply_step] (0x2000): The default ccache is 1000 * (2022-03-23 14:54:04): [kcm] [kcm_cmd_done] (0x0400): KCM operation GET_DEFAULT_CACHE returned [0]: Success * (2022-03-23 14:54:04): [kcm] [kcm_send_reply] (0x2000): Sending a reply * (2022-03-23 14:54:04): [kcm] [kcm_output_construct] (0x1000): Sending a reply with 9 bytes of payload * (2022-03-23 14:54:04): [kcm] [queue_removal_cb] (0x0200): Removed queue for 1000 * (2022-03-23 14:54:04): [kcm] [kcm_send] (0x2000): All data sent! * (2022-03-23 14:54:04): [kcm] [kcm_input_parse] (0x1000): Received message with length 9 * (2022-03-23 14:54:04): [kcm] [kcm_get_opt] (0x2000): The client requested operation 8 * (2022-03-23 14:54:04): [kcm] [kcm_cmd_send] (0x0400): KCM operation GET_PRINCIPAL * (2022-03-23 14:54:04): [kcm] [kcm_cmd_send] (0x1000): 5 bytes on KCM input * (2022-03-23 14:54:04): [kcm] [kcm_op_queue_send] (0x0200): Adding request by 1000 to the wait queue * (2022-03-23 14:54:04): [kcm] [kcm_op_queue_get] (0x1000): No existing queue for this ID Hello everyone, IHAC who faces the same issue (case: 03157166), sees sssd_kcm taking up 1 CPU core almost all the time. This is causing issues with his host where he has only 2 vCPUs. I have collected the debug logs, but the logs don't seem to contain anything other than the following. (2022-04-07 8:40:03): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory (2022-04-07 8:40:25): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory (2022-04-07 8:43:10): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory (2022-04-07 8:43:33): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory (2022-04-07 8:45:03): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory (2022-04-07 8:45:25): [kcm] [sec_get] (0x0040): Cannot retrieve the secret [2]: No such file or directory Any suggestions or steps to work on about the issue? Regards Sreekanth (In reply to sreekanth from comment #21) > Hello everyone, > > IHAC who faces the same issue (case: 03157166), sees sssd_kcm taking up 1 > CPU core almost all the time. > This is causing issues with his host where he has only 2 vCPUs. > > I have collected the debug logs, but the logs don't seem to contain anything > other than the following. > > (2022-04-07 8:40:03): [kcm] [sec_get] (0x0040): Cannot retrieve the secret > [2]: No such file or directory > (2022-04-07 8:40:25): [kcm] [sec_get] (0x0040): Cannot retrieve the secret > [2]: No such file or directory > (2022-04-07 8:43:10): [kcm] [sec_get] (0x0040): Cannot retrieve the secret > [2]: No such file or directory > (2022-04-07 8:43:33): [kcm] [sec_get] (0x0040): Cannot retrieve the secret > [2]: No such file or directory > (2022-04-07 8:45:03): [kcm] [sec_get] (0x0040): Cannot retrieve the secret > [2]: No such file or directory > (2022-04-07 8:45:25): [kcm] [sec_get] (0x0040): Cannot retrieve the secret > [2]: No such file or directory > > > Any suggestions or steps to work on about the issue? What is the SSSD / RHEL version? Please enable 'debug_level = 9' for [kcm] and provide sssd_kcm.log rhel 9.1 using sssd-kcm 2.7.3 ... I was able to reproduce it consistently by starting a tiger-vnc using a Desktop Manager (.vnc/xstartup). I have tried gnome, mate, and xfce4... as soon as you kill the vncserver... the cpu drops back down... but as soon as you start the vncserver using one of these desktop managers... it goes to 100% and stays there. Hi, (In reply to rhel_beast from comment #46) > rhel 9.1 using sssd-kcm 2.7.3 ... I was able to reproduce it consistently by > starting a tiger-vnc using a Desktop Manager (.vnc/xstartup). I have tried > gnome, mate, and xfce4... as soon as you kill the vncserver... the cpu drops > back down... but as soon as you start the vncserver using one of these > desktop managers... it goes to 100% and stays there. I wonder if you hit sssd_kcm issue described in comment 11 or https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79 Is there "No buffer space available" in /var/log/sssd/sssd_kcm.log? What is the size of /var/lib/sss/secrets/secrets.ldb when issue occurs? Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |