Bug 1659498
Summary: | Re-setting the trusted AD domain fails due to wrong subdomain service name being used | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Sergey Orlov <sorlov> | ||||||||||||
Component: | sssd | Assignee: | SSSD Maintainers <sssd-maint> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Kaleem <ksiddiqu> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 8.0 | CC: | grajaiya, jhrozek, ksiddiqu, lslebodn, mzidek, pbrezina, sgoveas, sorlov, tscherf, wchadwic | ||||||||||||
Target Milestone: | rc | Keywords: | Regression | ||||||||||||
Target Release: | 8.0 | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | sssd-2.0.0-41.el8 | Doc Type: | If docs needed, set a value | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2019-06-14 02:05:40 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 1652753 | ||||||||||||||
Attachments: |
|
While I'm not going to dispute the Regression keyword in general, I think it is fair to add that the issue was added in 1.14, so RHEL-7.4. The fact that nobody hit the issue in 7.5 or 7.6 makes the issue a little less concerning to me. Upstream ticket: https://pagure.io/SSSD/sssd/issue/3911 Sergey, can you qa_ack the bug? There is a PR on review and I passed a test build to Kaleem and Anuja. * master: * aaaa9a3e836e7b7af1ff0fc5058ddddfeef120a8 * b3285f9f8a5eac3e4e70ed3bd6b74c15ad806e9e Still not working as expected: Case 1: * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true * id Administrator * id Vagrant => both users found immediately Case 2: * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true * reset_sss_cache * repeat "id Administrator" for 200 seconds => id: ‘Administrator’: no such user, Case 3: * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true * reset sssd cache and restart sssd * repeat "reset_sss_cache; id Administrator" for 200 seconds => id: ‘Administrator’: no such user, For reference: reset_sss_cache() { systemctl stop sssd rm -f /var/lib/sss/{db,mc}/* find /var/lib/sss/db -name '*.ldb' | xargs rm -fv systemctl start sssd echo Secret123 | kinit admin > /dev/null } All tests executed on clean VMs That's interesting because previously you said the issue was resolved by a test build with this fix. Anyway, I can't say more without seeing logs. Created attachment 1538423 [details]
ssd log for case1
Created attachment 1538424 [details]
ssd logs for case1
Created attachment 1538425 [details]
ssd logs for case3
I think I checked only case1 in that test. I attached logs for three cases. (sorry, I mistyped in comment for case2, but filename is correct) sssd version: sssd-common-2.0.0-43.el8.x86_64 This seems to be a different issue. In the second and third case, the is not fetched during the first-time setup: (Mon Feb 25 04:38:00 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_timeout] (0x0020): Timeout reached for retrieving keytab from IPA server (Mon Feb 25 04:38:00 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_recv] (0x2000): ipa-getkeytab status 110 Then the lookups always fail immediately: (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ipa_srv_ad_acct_lookup_step] (0x0400): Looking up AD account (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [sss_domain_get_state] (0x1000): Domain testrealm.test is Active (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [sss_domain_get_state] (0x1000): Domain ad.test is Inactive (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [be_mark_dom_offline] (0x1000): Marking subdomain ad.test offline (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x55f91fc23570 (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x55f91fc0adc0 (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Running timer event 0x55f91fc23570 "ltdb_callback" (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Destroying timer event 0x55f91fc0adc0 "ltdb_timeout" (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Destroying timer event 0x55f91fc23570 "ltdb_callback" (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [be_mark_subdom_offline] (0x4000): Subdomain already inactive (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [22]: Invalid argument. (Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [22]: Invalid argument Looking at when does ipa_srv_ad_acct_lookup_step() return EINVAL, it's when the subdomain does not have the ad_id_ctx created at all. There are two issues: 1) the keytab is not re-fetched on this kind of an error 2) the ad_id_ctx is not created even if the keytab was then fetched sucessfully Note that this issue only happens in the first keytab retrieval fails and thus the ad_id_ctx is not created at all. This is different from the bug originally reported which was about not retrying a domain that was first marked as offline after being created. btw making sure that the ipa-getkeytab invocation does not time out would fix all these issues, at least on the IPA masters. Do you know why would the process time out? I added one more test case, where I simply restart sssd instead of resetting cache: Case 4: * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true * systemctl restart sssd * repeat "id Administrator" for 200 seconds => id: ‘Administrator’: no such user, Created attachment 1538455 [details]
case 4 sssd logs
(In reply to Sergey Orlov from comment #17) > I added one more test case, where I simply restart sssd instead of resetting > cache: > > Case 4: > * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad > --external=true > * systemctl restart sssd > * repeat "id Administrator" for 200 seconds > => id: ‘Administrator’: no such user, Still the same issue, it takes more than 5 seconds, for ipa-getkeytab to finish: (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_send] (0x0400): Retrieving keytab for HOST161$@AD.TEST from host161.testrealm.test into /var/lib/sss/keytabs/ad.test.keytab3WVRKr using ccache /var/lib/sss/db/ccache_T> (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [6693] (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [child_handler_setup] (0x2000): Signal handler set up for pid [6693] (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [sdap_process_result] (0x2000): Trace: sh[0x564b6ea96bf0], connected[1], ops[(nil)], ldap[0x564b6ea86060] (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [sdap_process_result] (0x2000): Trace: end of ldap_result list (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_online_cb] (0x0400): Back end is online (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_enable] (0x0080): Task [Subdomains Refresh]: already enabled (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_online_cb] (0x0400): Back end is online (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_enable] (0x0080): Task [SUDO Smart Refresh]: already enabled (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_online_cb] (0x0400): Back end is online (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_enable] (0x0080): Task [SUDO Full Refresh]: already enabled (Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [delayed_online_authentication_callback] (0x0200): Backend is online, starting delayed online authentication. (Mon Feb 25 09:22:43 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_timeout] (0x0020): Timeout reached for retrieving keytab from IPA server (Mon Feb 25 09:22:43 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_recv] (0x2000): ipa-getkeytab status 110 Fix in this BZ is verified by Case 1 in https://bugzilla.redhat.com/show_bug.cgi?id=1659498#c9 New discovered issuued will be handled in https://bugzilla.redhat.com/show_bug.cgi?id=1681279 |
Created attachment 1514383 [details] reproducer script, sssd logs, sssd configs Description of problem: After establishing trust between IPA server and AD we need to few for several minutes before we can get result of "id user@ADOMAIN" command. Version-Release number of selected component (if applicable): sssd-common-2.0.0-23.el8.x86_64 How reproducible: Stable Steps to Reproduce: 1. have IPA server and AD server configured and prepared for establishing trust 2. ipa trust-add windows.domain --admin Administrator --type=ad --external=true 3. run "id administrator" 4. repeatedly execute same command for 1-2 minutes Actual results: for about one or two minutes we get "id: ‘administrator’: no such user" and then finally we get user id and group membership Expected results: user id and group membership displayed on first invocation of "id" Additional info: I attach: * reproducer script * output of script run * log files of sssd collected during this run * sssd config files Comment by Jakub Hrozek: For reasons I forgot we renamed all subdomain services to start with the "sd_" prefix: (Thu Dec 13 05:29:20 2018) [sssd[be[testrelm.test]]] [fo_new_service] (0x0400): Creating new service 'sd_ipaad2016.test' But what happens in your environment is that the first lookups fails: (Thu Dec 13 05:29:30 2018) [sssd[be[testrelm.test]]] [sdap_kinit_done] (0x0100): Could not get TGT: 14 [Bad address] (Thu Dec 13 05:29:30 2018) [sssd[be[testrelm.test]]] [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret [1432158226](Authentication Failed) And then when we want to re-set the domain status, we use the name without the "sd_" prefix apparently: (Thu Dec 13 05:29:31 2018) [sssd[be[testrelm.test]]] [ipa_srv_ad_acct_retried] (0x0400): Subdomain re-set, will retry lookup (Thu Dec 13 05:29:31 2018) [sssd[be[testrelm.test]]] [be_fo_reset_svc] (0x1000): Resetting all servers in service ipaad2016.test (Thu Dec 13 05:29:31 2018) [sssd[be[testrelm.test]]] [be_fo_reset_svc] (0x0080): Cannot retrieve service [ipaad2016.test] So a domain is not re-set until an internal timeout expires.