Bug 2213576
| Summary: | Obsolete entries in disk cache can interfere with getpwuid() lookups | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Dan Astoorian <djast> | ||||
| Component: | sssd | Assignee: | Pavel Březina <pbrezina> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | sssd-qe | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 8.8 | CC: | aboscatt, atikhono, pbrezina | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | sync-to-jira | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-07-06 13:10:09 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hi, could you please attach entire sssd_nss.log that covers "getent passwd 12345" for 'test2'? Created attachment 1969791 [details]
(redacted) sssd_nss.log
I've redacted the domain, and replaced the actual user's uid with "12345", but here's the relevant sssd_nss.log.
(In reply to Dan Astoorian from comment #2) > Created attachment 1969791 [details] > (redacted) sssd_nss.log > > I've redacted the domain, and replaced the actual user's uid with "12345", > but here's the relevant sssd_nss.log. Hm... I hoped it would tell us how cache got into this situation, but in your log there are already 2 objects in the cache. My guess is that after a delete/create and a lookup, "[old] object found but needs to be refreshed" and then 'sssd_be[$domain]' adds a new record with new DN (but the same UID). If that's the case then a backend should probably search/delete existing records with matching uid (or cache db should have unique constraint on uid?), but frankly delete/create an object with new DN and matching UID is quite a corner case. Also take a note that while cache isn't expired 'getent' will be returning old (stale) record and you need to purge local cache anyway. Note that it is (unfortunately) generally possible to have multiple user records in AD with the same uidNumber field simultaneously, although it's difficult to view this as anything but a misconfiguration; in such cases sssd_nss refusing to choose between them is arguably the correct behaviour. This suggests that simply introducing a unique constraint on uidNumber in the cache might have unintended consequences; invalidating cached records that are no longer present is probably the safer approach. For what it's worth, in the environment in which I encountered the issue, user "test1" had been deleted more than 4 months before user "test2" was created re-using its uid, which is why I was surprised that the record had not expired from the disk cache. For another unrelated user that had been deleted from AD around the same time as "test1", I observed this: # ldbsearch -H /var/lib/sss/db/cache_DOMAIN.ldb '(uidNumber=22222)' |& grep returned # returned 1 records # getent passwd 22222 # ldbsearch -H /var/lib/sss/db/cache_DOMAIN.ldb '(uidNumber=22222)' |& grep returned # returned 0 records I.e., searching for the uidNumber of a user who has been deleted from AD causes the record to be deleted from the disk cache, but only if the uid has not already been reassigned to another user. Note that "getent passwd 22222" correctly did not return the stale entry in this case. I have not confirmed whether the same issue exists with groups/gids, but it might be worth investigating whether a corresponding fix is needed for getgrgid() as well. Irrespective of any deficiencies in the caching strategy, I don't know whether it might be practical to add options to sss_cache to purge the disk cache(s)(rather than stopping the daemon and deleting the files manually, which is so far the only remedy I'm aware of), but that's beyond the scope of this bug report. Hi, (In reply to Dan Astoorian from comment #4) > > For what it's worth, in the environment in which I encountered the issue, > user "test1" had been deleted more than 4 months before user "test2" was > created re-using its uid, which is why I was surprised that the record had > not expired from the disk cache. It's actually "expired" but still kept in the cache (as expired) for different reasons (for example to use in NSS call in case backend is offline; it also might be costly to actually delete records from the db). `entry_cache_timeout` (and `entry_cache_*_timeout`) config option defines when record is *considered* expired, but it doesn't actually purge record from the db. There is `ldap_purge_cache_timeout` config option (disabled by default) that is somewhat relevant. I'm not sure if it purges everything expired (probably not), but reading `man sssd-ldap` - "the cleanup task is required in order to detect entries removed from the server" - hints it could help in your specific use case. > Irrespective of any deficiencies in the caching strategy, I don't know > whether it might be practical to add options to sss_cache to purge the disk > cache(s)(rather than stopping the daemon and deleting the files manually, > which is so far the only remedy I'm aware of) FWIW, there is `sssctl cache-remove -p -s`. If the record is expired in the on-disk cache, does this imply that sssd_nss was not checking the expiration status of the records before issuing the "Multiple objects were found when only one was expected" error? Would that be an adequate fix? (In reply to Dan Astoorian from comment #6) > If the record is expired in the on-disk cache, does this imply that sssd_nss > was not checking the expiration status of the records before issuing the > "Multiple objects were found when only one was expected" error? My guess is: - 'sssd_nss' detects entry is expired and asks 'sssd_be[$domain]' to refresh it - 'sssd_be' instead of a refresh adds a new entry (because of new DN) and tells 'sssd_nss' "done" - 'sssd_nss' discovers 2 entries and boils out with an error Do you mean 'sssd_nss' could ignore expired entries at a last step? I'm afraid I'm not intimately familiar with the protocol between sssd_nss and sssd_be, but when the traceback started, there would have been no records in the memory cache and two records in the disk cache, one of which would have been expired, since user "test1" had been deleted from AD months ago. I presume /var/lib/sss/db/cache_DOMAIN.ldb is sssd_be[DOMAIN]'s disk cache, outside the scope of sssd_nss? If sssd_nss is asking sssd_be[$domain] for records matching uidNumber=12345, can/should sssd_be[$domain] ignore expired matching entries in its disk cache, or at least try to refresh them (which would presumably cause them to be removed from the disk cache if AD indicates the expired records have been deleted)? > Steps to Reproduce: > 1. Create a user "test1" in Active Directory with a specific UID, e.g., > uidNumber: 12345 > 2. Use this user entry on a Linux client that has joined the realm which uses this AD; confirm (e.g., via ldbsearch -H /var/lib/sss/db/cache_DOMAIN.ldp [...]) that the entry has been cached on disk. > 3. Delete the "test1" user in Active Directory. > 4. Create a new user "test2" in Active directory which re-uses the same UID (uidNumber: 12345). Use this user entry on the same Linux client. > Observe that "ldbsearch -H /var/lib/sss/db/cache_DOMAIN.ldb '(uidNumber=12345)'" now returns 2 records. > 5. On the Linux client, restart sssd and/or clear the memory cache via "sss_cache -E" > 6. Attempt to look up the uid via "getent passwd 12345" Hi Dan, I tried to reproduce it with the steps above, also with expiring cached test1 between 2-3 but so far I was not able to reproduce this issue. I always have only single user cached, fetching test2 overrides test1. - Are you able to reproduce this at will with the steps above or was it a one time event? - If you can reproduce it, can you attach domain log from the reproducer? - Can you attach your sssd.conf? We actually handle this situation in the code and last modifications were 10 a 7 years ago so its included in sssd-2.8: https://github.com/SSSD/sssd/blob/2fd5374fdf78bc7330bd9e6f3b86bec86bdf592b/src/db/sysdb_ops.c#L1929-L1936 https://github.com/SSSD/sssd/blob/2fd5374fdf78bc7330bd9e6f3b86bec86bdf592b/src/db/sysdb_ops.c#L2666-L2686 We've reproduced the issue many times on multiple workstations with many different pairs of users. It may be relevant that our Active Directory instance is a Microsoft Windows domain, and users are added/deleted on the Windows side, not via sssd. I'm therefore not sure that the blobs from sysdb_ops.c you cite are relevant to our setup, as I presume that sysdb_store_new_user() and sysdb_add_user() are not invoked if sssd is not the entity making the changes to Active Directory. Can you reproduce the issue if you delete/create the users from a distinct client machine from the one where you look up the users/uids? The clients were joined to the domain via "realm join"; the sssd.conf file on the clients contains: [sssd] domains = [redacted] config_file_version = 2 services = nss, pam [domain/[redacted]] ad_domain = [redacted] krb5_realm = [redacted] realmd_tags = manages-system joined-with-adcli cache_credentials = True id_provider = ad krb5_store_password_if_offline = True default_shell = /bin/bash ldap_id_mapping = False use_fully_qualified_names = False fallback_homedir = /home/%u@%d access_provider = ad The sssd_nss.log from the client is attached in Comment #2; if that isn't what you mean by "domain log from the reproducer," please let me know specifically which file you need, keeping in mind that the Active Directory backend is not managed by sssd in our environment. Hi, the sysdb pieces are relevant, this is code that performs caching. I.e. sysdb_store_new_user/add_user is invoked to add user found in AD to SSSD cache - the file you searched with ldbsearch. So I tried to reproduced it precisely the way you do - ie by adding/deleted users in Active Directory (not through SSSD, we don't even have such functionality). I used similar config you have, also created by realmd, I just did not try it with cache_credentials. I will try that. The domain log I requested will be /var/log/sssd/sssd_$domainname.log. This log file contains information about process that actually stores the users and writes to cache. sssd_nss.log comes from process that only read the cache but does not write to it. So I want to have more understanding of how it is possible that two user entries were written instead of replacing the old one, thus the domain log file. I have not been successful in reproducing the condition of multiple records for the same uid in the ldb cache from a standing start; the instances I've encountered have all involved real users in our production environment. It's likely that there's a factor I've overlooked--perhaps differences in the users' respective group memberships that trigger the addition of cache entries for users during a group lookup that don't go through the duplicate uid checks or something? I'll report back if I'm able to reproduce the duplicate cache entries starting from a clean slate; my profound apologies for not confirming the methodology to reproduce the bug before submitting it. Thank you. I did not reproduce it with your configuration either. I am afraid that I have no leads at this moment since the code expect this case and handles it correctly. Please, let us know when you are able to gather the log file so we can hopefully get some cues or if you have a reliable reproducer. Removing the Triaged keyword until we have a way to reproduce the issue and confirm it is a bug. Hi Dan, It just crossed my mind, do you prefer us to keep this BZ opened and a needinfo+ flag set on you, or can we close it as NOTABUG and once/if there is evidence about how to reproduce it, the BZ can be reopened? Kind regards Bugzilla seems to send me a "Your Outstanding Requests" e-mail on a daily basis when the needinfo+ flag is set; it wasn't clear to me how to acknowledge the flag without clearing it. The BZ can be closed pending further information, but there is definitely a bug here--I've encountered it multiple times, but just have not determined the specific sequence of operations that triggers it--so INSUFFICIENT_DATA is probably a better resolution than NOTABUG. Thanks. Yes, the nudge system is on a daily basis to kindly remind you there is something waiting on you. The idea is to clear the flag once you provide the information, but there is no way (AFAIK) to set it to weekly/monthly. Agreed, I'll proceed with the INSUFFICIENT_DATA, and please come back to us once you figure it out. Kind Regards |
Description of problem: If a uidNumber is reused for a new account (e.g., an entry for an account is removed from Active Directory and a different account created with the same uidNumber value), obsolete cache entries in /var/lib/sss/db/cache_*.ldb may prevent getpwuid() (via sssd_nss) from returning the appropriate result. Version-Release number of selected component (if applicable): sssd-2.8.2-2.el8.x86_64 How reproducible: ? Steps to Reproduce: 1. Create a user "test1" in Active Directory with a specific UID, e.g., uidNumber: 12345 2. Use this user entry on a Linux client that has joined the realm which uses this AD; confirm (e.g., via ldbsearch -H /var/lib/sss/db/cache_DOMAIN.ldp [...]) that the entry has been cached on disk. 3. Delete the "test1" user in Active Directory. 4. Create a new user "test2" in Active directory which re-uses the same UID (uidNumber: 12345). Use this user entry on the same Linux client. Observe that "ldbsearch -H /var/lib/sss/db/cache_DOMAIN.ldb '(uidNumber=12345)'" now returns 2 records. 5. On the Linux client, restart sssd and/or clear the memory cache via "sss_cache -E" 6. Attempt to look up the uid via "getent passwd 12345" Actual results: No results returned by "getent passwd 12345"; /var/log/sssd/sssd_nss.log contains log messages of the form (2023-06-05 12:23:51): [nss] [cache_req_search_ncache] (0x0400): [CID#115862] CR #235972: [UID:12345@DOMAIN] is not present in negative cache (2023-06-05 12:23:51): [nss] [cache_req_search_cache] (0x0400): [CID#115862] CR #235972: Looking up [UID:12345@DOMAIN] in cache 2023-06-05 12:23:51): [nss] [cache_req_search_cache] (0x0020): [CID#115862] CR #235972: Multiple objects were found when only one was expected! Expected results: The passwd entry for test2 should be returned by "getent passwd 12345". Additional info: Looking up the user by username will cause the correct uid to be cached in memory and returned by getpwuid(), but this is only a temporary remedy; e.g.: # getent passwd 12345 || echo missing missing # getent passwd test2 || echo missing test2:*:12345:12345:Test account 2:/home/test2:/bin/bash # getent passwd 12345 || echo missing test2:*:12345:12345:Test account 2:/home/test2:/bin/bash # sss_cache -E # getent passwd 12345 || echo missing missing # A possible workaround for affected clients is to stop sssd, remove the cache file in /var/lib/sss/db/, and restart sssd.