Bug 1659498

Summary: Re-setting the trusted AD domain fails due to wrong subdomain service name being used
Product: Red Hat Enterprise Linux 8 Reporter: Sergey Orlov <sorlov>
Component: sssdAssignee: SSSD Maintainers <sssd-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Kaleem <ksiddiqu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: grajaiya, jhrozek, ksiddiqu, lslebodn, mzidek, pbrezina, sgoveas, sorlov, tscherf, wchadwic
Target Milestone: rcKeywords: Regression
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-2.0.0-41.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-14 02:05:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1652753    
Attachments:
Description Flags
reproducer script, sssd logs, sssd configs
none
ssd log for case1
none
ssd logs for case1
none
ssd logs for case3
none
case 4 sssd logs none

Description Sergey Orlov 2018-12-14 14:31:33 UTC
Created attachment 1514383 [details]
reproducer script, sssd logs, sssd configs

Description of problem:
After establishing trust between IPA server and AD we need to few for several minutes before we can get result of "id user@ADOMAIN" command.

Version-Release number of selected component (if applicable):
sssd-common-2.0.0-23.el8.x86_64


How reproducible:
Stable

Steps to Reproduce:
1. have IPA server and AD server configured and prepared for establishing trust
2. ipa trust-add windows.domain --admin Administrator --type=ad --external=true
3. run "id administrator"
4. repeatedly execute same command for 1-2 minutes

Actual results:
for about one or two minutes we get
"id: ‘administrator’: no such user"
and then finally we get user id and group membership

Expected results:
user id and group membership displayed on first invocation of "id"


Additional info:
I attach:
 * reproducer script
 * output of script run
 * log files of sssd collected during this run
 * sssd config files


Comment by Jakub Hrozek:

For reasons I forgot we renamed all subdomain services to start with the
"sd_" prefix:
    (Thu Dec 13 05:29:20 2018) [sssd[be[testrelm.test]]] [fo_new_service] (0x0400): Creating new service 'sd_ipaad2016.test'

But what happens in your environment is that the first lookups fails:
    (Thu Dec 13 05:29:30 2018) [sssd[be[testrelm.test]]] [sdap_kinit_done] (0x0100): Could not get TGT: 14 [Bad address]
    (Thu Dec 13 05:29:30 2018) [sssd[be[testrelm.test]]] [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret [1432158226](Authentication Failed)

And then when we want to re-set the domain status, we use the name
without the "sd_" prefix apparently:
    (Thu Dec 13 05:29:31 2018) [sssd[be[testrelm.test]]] [ipa_srv_ad_acct_retried] (0x0400): Subdomain re-set, will retry lookup
    (Thu Dec 13 05:29:31 2018) [sssd[be[testrelm.test]]] [be_fo_reset_svc] (0x1000): Resetting all servers in service ipaad2016.test
    (Thu Dec 13 05:29:31 2018) [sssd[be[testrelm.test]]] [be_fo_reset_svc] (0x0080): Cannot retrieve service [ipaad2016.test]

So a domain is not re-set until an internal timeout expires.

Comment 1 Jakub Hrozek 2018-12-17 10:20:58 UTC
While I'm not going to dispute the Regression keyword in general, I think it is fair to add that the issue was added in 1.14, so RHEL-7.4. The fact that nobody hit the issue in 7.5 or 7.6 makes the issue a little less concerning to me.

Comment 2 Jakub Hrozek 2018-12-20 14:22:04 UTC
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3911

Comment 3 Jakub Hrozek 2018-12-20 21:13:22 UTC
Sergey, can you qa_ack the bug? There is a PR on review and I passed a test build to Kaleem and Anuja.

Comment 7 Jakub Hrozek 2019-01-28 20:45:06 UTC
* master:
 * aaaa9a3e836e7b7af1ff0fc5058ddddfeef120a8
 * b3285f9f8a5eac3e4e70ed3bd6b74c15ad806e9e

Comment 9 Sergey Orlov 2019-02-25 12:29:45 UTC
Still not working as expected:

Case 1:
 * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true
 * id Administrator 
 * id Vagrant 
 => both users found immediately

Case 2:
 * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true
 * reset_sss_cache
 * repeat "id Administrator" for 200 seconds
  => id: ‘Administrator’: no such user, 

Case 3:
 * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true
 * reset sssd cache and restart sssd
 * repeat "reset_sss_cache; id Administrator" for 200 seconds
  => id: ‘Administrator’: no such user, 


For reference: 
reset_sss_cache() {
    systemctl stop sssd
    rm -f /var/lib/sss/{db,mc}/*
    find /var/lib/sss/db -name '*.ldb' | xargs rm -fv
    systemctl start sssd
    echo Secret123 | kinit admin > /dev/null
}


All tests executed on clean VMs

Comment 10 Jakub Hrozek 2019-02-25 12:36:25 UTC
That's interesting because previously you said the issue was resolved by a test build with this fix.

Anyway, I can't say more without seeing logs.

Comment 11 Sergey Orlov 2019-02-25 12:52:41 UTC
Created attachment 1538423 [details]
ssd log for case1

Comment 12 Sergey Orlov 2019-02-25 12:53:07 UTC
Created attachment 1538424 [details]
ssd logs for case1

Comment 13 Sergey Orlov 2019-02-25 12:53:33 UTC
Created attachment 1538425 [details]
ssd logs for case3

Comment 14 Sergey Orlov 2019-02-25 12:57:01 UTC
I think I checked only case1  in that test.


I attached logs for three cases. (sorry, I mistyped in comment for case2, but filename is correct)
sssd version: sssd-common-2.0.0-43.el8.x86_64

Comment 15 Jakub Hrozek 2019-02-25 13:41:36 UTC
This seems to be a different issue. In the second and third case, the is not fetched during the first-time setup:
(Mon Feb 25 04:38:00 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_timeout] (0x0020): Timeout reached for retrieving keytab from IPA server
(Mon Feb 25 04:38:00 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_recv] (0x2000): ipa-getkeytab status 110

Then the lookups always fail immediately:
(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ipa_srv_ad_acct_lookup_step] (0x0400): Looking up AD account
(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [sss_domain_get_state] (0x1000): Domain testrealm.test is Active
(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [sss_domain_get_state] (0x1000): Domain ad.test is Inactive
(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [be_mark_dom_offline] (0x1000): Marking subdomain ad.test offline
(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x55f91fc23570

(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x55f91fc0adc0

(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Running timer event 0x55f91fc23570 "ltdb_callback"

(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Destroying timer event 0x55f91fc0adc0 "ltdb_timeout"

(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ldb] (0x4000): Destroying timer event 0x55f91fc23570 "ltdb_callback"

(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [be_mark_subdom_offline] (0x4000): Subdomain already inactive
(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: [22]: Invalid argument.
(Mon Feb 25 04:38:02 2019) [sssd[be[testrealm.test]]] [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [22]: Invalid argument

Looking at when does ipa_srv_ad_acct_lookup_step() return EINVAL, it's when the subdomain does not have the ad_id_ctx created at all. There are two issues:
1) the keytab is not re-fetched on this kind of an error
2) the ad_id_ctx is not created even if the keytab was then fetched sucessfully

Note that this issue only happens in the first keytab retrieval fails and thus the ad_id_ctx is not created at all. This is different from the bug originally reported which was about not retrying a domain that was first marked as offline after being created.

Comment 16 Jakub Hrozek 2019-02-25 13:42:17 UTC
btw making sure that the ipa-getkeytab invocation does not time out would fix all these issues, at least on the IPA masters. Do you know why would the process time out?

Comment 17 Sergey Orlov 2019-02-25 14:43:58 UTC
I added one more test case, where I simply restart sssd instead of resetting cache:

Case 4:
 * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad --external=true
 * systemctl restart sssd
 * repeat "id Administrator" for 200 seconds
  => id: ‘Administrator’: no such user,

Comment 18 Sergey Orlov 2019-02-25 14:44:28 UTC
Created attachment 1538455 [details]
case 4 sssd logs

Comment 19 Jakub Hrozek 2019-02-25 15:18:07 UTC
(In reply to Sergey Orlov from comment #17)
> I added one more test case, where I simply restart sssd instead of resetting
> cache:
> 
> Case 4:
>  * /usr/bin/ipa trust-add ad.test --admin Administrator --type=ad
> --external=true
>  * systemctl restart sssd
>  * repeat "id Administrator" for 200 seconds
>   => id: ‘Administrator’: no such user,

Still the same issue, it takes more than 5 seconds, for ipa-getkeytab to finish:
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_send] (0x0400): Retrieving keytab for HOST161$@AD.TEST from host161.testrealm.test into /var/lib/sss/keytabs/ad.test.keytab3WVRKr using ccache /var/lib/sss/db/ccache_T>
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [6693]
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [child_handler_setup] (0x2000): Signal handler set up for pid [6693]
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [sdap_process_result] (0x2000): Trace: sh[0x564b6ea96bf0], connected[1], ops[(nil)], ldap[0x564b6ea86060]
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [sdap_process_result] (0x2000): Trace: end of ldap_result list
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_online_cb] (0x0400): Back end is online
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_enable] (0x0080): Task [Subdomains Refresh]: already enabled
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_online_cb] (0x0400): Back end is online
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_enable] (0x0080): Task [SUDO Smart Refresh]: already enabled
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_online_cb] (0x0400): Back end is online
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [be_ptask_enable] (0x0080): Task [SUDO Full Refresh]: already enabled
(Mon Feb 25 09:22:38 2019) [sssd[be[testrealm.test]]] [delayed_online_authentication_callback] (0x0200): Backend is online, starting delayed online authentication.
(Mon Feb 25 09:22:43 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_timeout] (0x0020): Timeout reached for retrieving keytab from IPA server
(Mon Feb 25 09:22:43 2019) [sssd[be[testrealm.test]]] [ipa_getkeytab_recv] (0x2000): ipa-getkeytab status 110

Comment 20 Sergey Orlov 2019-02-25 16:39:15 UTC
Fix in this BZ is verified by Case 1 in https://bugzilla.redhat.com/show_bug.cgi?id=1659498#c9

New discovered issuued will be handled in https://bugzilla.redhat.com/show_bug.cgi?id=1681279