Bug 1318654

Summary: some trusted users are not able to logon
Product: Red Hat Enterprise Linux 7 Reporter: Silvio Wanka <Silvio.Wanka>
Component: sssdAssignee: SSSD Maintainers <sssd-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Steeve Goveas <sgoveas>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: grajaiya, jhrozek, lslebodn, mkosek, mzidek, pbrezina, preichl, Silvio.Wanka
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-29 19:50:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
debug_level = 0xfff0 in domain section none

Description Silvio Wanka 2016-03-17 13:01:46 UTC
Created attachment 1137385 [details]
debug_level = 0xfff0 in domain section

IPA domain with trust to AD.

4 AD users are allowed to logon to a server (hbac rule from anny source to this server any services).

2 of them are able to logon, the two others not. They are member of the same external group which is member of a posix user group which is used in the hbac rule.

Logon tried via SSH and on the console, always the same.

Comment 2 Jakub Hrozek 2016-03-17 13:30:52 UTC
Well, the answer is right there in the logs:

(Thu Mar 17 10:16:51 2016) [sssd[be[ipa.company.com]]] [write_pipe_handler] (0x0400): All data has been sent!
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [krb5_child_timeout] (0x0040): Timeout for child [23166] reached. In case KDC is distant or network is slow you may consider increasing value of krb5_auth_timeout.
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [krb5_auth_done] (0x0020): child timed out!
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_NOT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 836
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [fo_set_port_status] (0x0100): Marking port 0 of server 'dedcs1-idm02.ipa.company.com' as 'not working'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [fo_set_port_status] (0x0400): Marking port 0 of duplicate server 'dedcs1-idm02.ipa.company.com' as 'not working'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'IPA'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [get_server_status] (0x1000): Status of server 'dedcs1-idm01.ipa.company.com' is 'working'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [get_port_status] (0x1000): Port status of port 389 for server 'dedcs1-idm01.ipa.company.com' is 'not working'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [get_server_status] (0x1000): Status of server 'dedcs1-idm02.ipa.company.com' is 'name resolved'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [get_port_status] (0x1000): Port status of port 389 for server 'dedcs1-idm02.ipa.company.com' is 'not working'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [get_server_status] (0x1000): Status of server 'dedcs1-idm02.ipa.company.com' is 'name resolved'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [get_port_status] (0x1000): Port status of port 0 for server 'dedcs1-idm02.ipa.company.com' is 'not working'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [fo_resolve_service_send] (0x0020): No available servers for service 'IPA'
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [be_mark_dom_offline] (0x1000): Marking subdomain ad-company.com offline
(Thu Mar 17 10:16:57 2016) [sssd[be[ipa.company.com]]] [be_mark_subdom_offline] (0x1000): Marking subdomain ad-company.com as inactive


Please try increasing the timeout.

Comment 3 Silvio Wanka 2016-03-17 14:34:08 UTC
> Please try increasing the timeout.

As I wrote in some posts the current IPA (sssd?) version uses the AD site informations. So there should be no timeouts because the AD DCs are in same datacenter only different subnetworks.

But I will try it and must ask which timeout in which config file?

TIA,
Silvio

Comment 4 Silvio Wanka 2016-03-17 14:35:01 UTC
(In reply to Silvio Wanka from comment #3)
> As I wrote in some posts the current IPA (sssd?) version uses the AD site

As I read in some ... / Sorry for this mistake

Comment 5 Jakub Hrozek 2016-03-17 14:52:20 UTC
What can also happen during the authentication is that the PAC responder might e contacted and start resolving SIDs that are not cached. You can try if ignore_group_members helps, see:
https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/

Comment 6 Silvio Wanka 2016-03-17 15:07:24 UTC
(In reply to Jakub Hrozek from comment #5)
> You can try if ignore_group_members helps, see:

This is already configured on both IdM Servers because otherwise it takes to much time at all, from sssd.conf on both IdM Server:

subdomain_inherit = ldap_user_principal, ignore_group_members
ldap_user_principal = nosuchattr
ignore_group_members = True
 

in domain section. And id "user@ad-domain" works without problems also for the user which can't logon.

Comment 7 Jakub Hrozek 2016-03-17 15:17:35 UTC
OK, so according to the krb5_child.log, the communication to AD is really slow:
(Thu Mar 17 10:16:51 2016) [[sssd[krb5_child[23166]]]] [sss_child_krb5_trace_cb] (0x4000): [23166] 1458206211.828116: Getting initial credentials for ste-die
(Thu Mar 17 10:16:51 2016) [[sssd[krb5_child[23166]]]] [sss_child_krb5_trace_cb] (0x4000): [23166] 1458206211.828200: FAST armor ccache: MEMORY:/var/lib/sss/db/fast_ccache_IPA.COMPANY.COM
(Thu Mar 17 10:16:51 2016) [[sssd[krb5_child[23166]]]] [sss_child_krb5_trace_cb] (0x4000): [23166] 1458206211.828243: Retrieving host/dedcs1-orat01.ipa.company.com.COM -> krb5_ccache_conf_data/fast_avail/krbtgt\/AD-COMPANY.COM\@AD-COMPANY.COM@X-CACHECONF: from MEMORY:/var/lib/sss/db/fast_ccache_IPA.COMPANY.COM with result: -1765328243/Matching credential not found
(Thu Mar 17 10:16:51 2016) [[sssd[krb5_child[23166]]]] [sss_child_krb5_trace_cb] (0x4000): [23166] 1458206211.828316: Sending request (174 bytes) to AD-COMPANY.COM
(Thu Mar 17 10:16:55 2016) [[sssd[krb5_child[23166]]]] [sss_child_krb5_trace_cb] (0x4000): [23166] 1458206215.240126: Resolving hostname tosdead02.ad-company.com.
(Thu Mar 17 10:16:57 2016) [[sssd[krb5_child[23167]]]] [main] (0x0400): krb5_child started.

10:16:57 is when sssd_be killed the child process and another one with a different PID started. As you can see from the log, just resolving  tosdead02.ad-company.com took 2 seconds..

Comment 8 Silvio Wanka 2016-03-17 15:41:19 UTC
(In reply to Jakub Hrozek from comment #7)
> OK, so according to the krb5_child.log, the communication to AD is really
> slow:...

Hi,

this is maybe a problem but then all users should not be able to logon and not only 2 of 4.

And the othe question is, these logs are from IPA client, why this client resolves the name from the AD DC, as I understud any KDC communication to AD DCs will be done by the IPA master servers not by the clients.

I have now simple tried "kinit ste-die" and it works (also very speedy):

# klist
Ticket cache: KEYRING:persistent:142603269:krb_ccache_Q1SntRv
Default principal: ste-die

Valid starting       Expires              Service principal
03/17/2016 16:31:00  03/18/2016 02:31:00  krbtgt/AD-COMPANY.COM
        renew until 03/18/2016 02:31:00

So what is the problem here?

Comment 9 Jakub Hrozek 2016-03-17 15:55:55 UTC
(In reply to Silvio Wanka from comment #8)
> (In reply to Jakub Hrozek from comment #7)
> > OK, so according to the krb5_child.log, the communication to AD is really
> > slow:...
> 
> Hi,
> 
> this is maybe a problem but then all users should not be able to logon and
> not only 2 of 4.
> 

Can you check if the authentications that time out are always against the same DC?

> And the othe question is, these logs are from IPA client, why this client
> resolves the name from the AD DC, as I understud any KDC communication to AD
> DCs will be done by the IPA master servers not by the clients.
> 

No, authentication is always done directly to the AD DCs, only identity lookups are proxied through IPA masters.

> I have now simple tried "kinit ste-die" and it works (also
> very speedy):
> 

Did 
> # klist
> Ticket cache: KEYRING:persistent:142603269:krb_ccache_Q1SntRv
> Default principal: ste-die
> 
> Valid starting       Expires              Service principal
> 03/17/2016 16:31:00  03/18/2016 02:31:00 
> krbtgt/AD-COMPANY.COM
>         renew until 03/18/2016 02:31:00
> 
> So what is the problem here?

I really don't know more than the logs tell me, sorry.. I guess it could help to prepend KRB5_TRACE=/dev/stderr before the kinit to see the same amount of debug data as we see in krb5_child. That way, maybe you could single out one DC that is acting up.

Also maybe grabbing tcpdump from a slow login might show where the communication lags..

Comment 10 Jakub Hrozek 2016-03-17 16:03:03 UTC
Another idea, stracing the krb5_child might show which syscall takes so long, see:
https://fedorahosted.org/sssd/wiki/DevelTips#UsingstracetotracktheSSSDprocesses

Comment 11 Silvio Wanka 2016-03-29 11:11:53 UTC
I have solved the problem in meantime by "sss_cache -E". IMO shows this that there is a bug in the sssd cache management. But because I can't deliver any helpfully logs anymore to find this bug you can decide to close this issue as unreproducable or what else.

THX

Comment 12 Jakub Hrozek 2016-03-29 19:50:49 UTC
Yes, I'm afraid with logs I need to close the bug. Thank you for reporting it nonetheless and I'm glad SSSD works for you now.