1379415 – SSSD always boots in Offline mode

Bug 1379415 - SSSD always boots in Offline mode

Summary: SSSD always boots in Offline mode

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sssd
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Pavel Březina
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-26 16:23 UTC by Colin.Simpson
Modified:	2019-01-07 14:42 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-06 19:04:12 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Full log file for this domain (1.62 MB, text/plain) 2016-09-26 16:47 UTC, Colin.Simpson	no flags	Details
sssd.log (46.41 KB, text/plain) 2016-09-28 14:39 UTC, Colin.Simpson	no flags	Details
sssd.log Version 2 (38.62 KB, text/plain) 2016-09-29 14:53 UTC, Colin.Simpson	no flags	Details
sssd_ukion.iongeo.com.log Version 2 (2.19 MB, text/plain) 2016-09-29 14:54 UTC, Colin.Simpson	no flags	Details
Show Obsolete (2) View All

Description Colin.Simpson 2016-09-26 16:23:58 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Lukas Slebodnik 2016-09-26 16:31:57 UTC

a) sssd does not boot anything
b) it's very common to provide log files.

Comment 2 Colin.Simpson 2016-09-26 16:46:26 UTC

Sorry don't know how I managed to commit this before updating details

Description of problem:
When I start F24 with sssd (1.14.1-2) and since the beginning it boots in offline mode. Logs suggest it can't find the DNS servers. Now I maybe missing an item (an NM dispatcher or something) in the setup, my test system here has been updated between versions.

Version-Release number of selected component (if applicable):
sssd (1.14.1-2)

How reproducible:
Everytime

Steps to Reproduce:
1. Boot System
2. sssctl domain-status iongeo.lan -o
3. See if shows 
Online status: Offline
4. Restarting sssd brings it back to "Online status: Online"


Additional info:
My reading of the log suggests it can't find the DNS servers that NM has put into resolv.conf maybe doesn't get reread. I have tried with nscd running (configured just for name service caching) and not running. I'll post the full log.

(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [get_port_status] (0x1000): Port status of port 0 for server '(no name)' is 'neutral'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [resolve_srv_send] (0x0200): The status of SRV lookup is neutral
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [ad_srv_plugin_send] (0x0400): About to find domain controllers
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [ad_get_dc_servers_send] (0x0400): Looking up domain controllers in domain ukion.iongeo.com
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [resolv_discover_srv_next_domain] (0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain 'ukion.iongeo.com'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.ukion.iongeo.com'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [request_watch_destructor] (0x0400): Deleting request watch
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [resolv_discover_srv_done] (0x0040): SRV query failed [11]: Could not contact DNS servers
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [resolve_srv_done] (0x0040): Unable to resolve SRV [1432158236]: SRV lookup error
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD' as 'not resolved'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (SRV lookup meta-server), resolver returned [1432158236]: SRV lookup error
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [get_port_status] (0x1000): Port status of port 0 for server '(no name)' is 'not working'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [fo_resolve_service_send] (0x0020): No available servers for service 'AD'
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [be_resolve_server_done] (0x1000): Server resolution failed: [5]: Input/output error
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [be_mark_offline] (0x2000): Going offline!
(Mon Sep 26 15:21:41 2016) [sssd[be[ukion.iongeo.com]]] [be_mark_offline] (0x2000): Enable check_if_online_ptask.

Comment 3 Colin.Simpson 2016-09-26 16:47:29 UTC

Created attachment 1204889 [details]
Full log file for this domain

Comment 4 Pavel Březina 2016-09-27 08:54:55 UTC

What happens if you manually signal sssd to try to connect, e.g.:

# kill -s SIGUSR2 $(pidof sssd)
# id not-cache-user

Does SSSD start after NM?

Comment 5 Colin.Simpson 2016-09-27 11:20:05 UTC

Looks like sssd is starting before NetworkManager

Sep 27 11:42:33 eden sssd[721]: Starting up

Sep 27 11:42:50 eden NetworkManager[742]: <info>  [1474972970.7890] NetworkManager (version 1.2.4-2.fc24) is starting...

Running the kill command and the id command, it still stays offline

Comment 6 Colin.Simpson 2016-09-27 13:37:42 UTC

When the force offline it still seems unable to find DNS servers

(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.ukion.iongeo.com'
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [request_watch_destructor] (0x0400): Deleting request watch
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [resolv_discover_srv_done] (0x0040): SRV query failed [11]: Could not contact DNS servers
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [fo_set_port_status] (0x0100): Marking port 0 of server '(no name)' as 'not working'
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [resolve_srv_done] (0x0040): Unable to resolve SRV [1432158236]: SRV lookup error
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'AD' as 'not resolved'
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (SRV lookup meta-server), resolver returned [1432158236]: SRV lookup error
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Tue Sep 27 14:33:56 2016) [sssd[be[ukion.iongeo.com]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'

As I say, happy to acknowledge something cookie with this setup.

Comment 7 Sumit Bose 2016-09-27 13:55:26 UTC

Do you get the DNS name servers via DHCP? The glibc resolver code will read /etc/resolv.conf only during the start of the process and will not be aware of any changes. It looks like the c-ares resolver SSSD uses might have the same limitation. So you should make sure SSSD starts after NM to make sure /etc/resolv.conf is up-to-date.

Comment 8 Colin.Simpson 2016-09-27 15:16:06 UTC

Yes we do get DNS from DHCP on desktops. We have seen this issue with glibc only ever using the resolv.conf as of when the program started. I was hoping sssd might have not done this as it's usually pretty dynamic. 

But starting NM first won't necessarily help, just depends how quickly DHCP responds, for example if ports aren't portfast they may not get an IP for many seconds after NM comes up. This will still break SSSD assuming the above.

Shouldn't NM look at resolv.conf or maybe have a dispatcher NM script to force a resolv.conf reload? 

I could maybe workaround by starting nscd (only doing host caching) up first, then we can have a dispatcher (I think we already have this) that will tell nscd to reload it's config (and resolv.conf). 

Though as I say, unless nscd is already running when other programs start their glibc resolvers won't use it but do direct resolve instead. So starting nscd first (preSSSD) may "fix" this.

Note sure if this issue is RFE stuff?

Comment 9 Colin.Simpson 2016-09-27 16:31:57 UTC

Making sssd start after nscd seems to workaround this issue, but this also seems to force NetworkManager before sssd (not a big issue but just an observation).

Comment 10 Jakub Hrozek 2016-09-27 20:28:39 UTC

(In reply to Sumit Bose from comment #7)
> Do you get the DNS name servers via DHCP? The glibc resolver code will read
> /etc/resolv.conf only during the start of the process and will not be aware
> of any changes. It looks like the c-ares resolver SSSD uses might have the
> same limitation. 

In general yes, c-ares doesn't re-read resolv.conf, but we do (try to) work around this limitation in sssd. See for example this code in monitor.c:
2137     /* Watch for changes to the DNS resolv.conf */                                                                                                                                                            
2138     ret = monitor_config_file(ctx, ctx, RESOLV_CONF_PATH,                                                                                                                                                     
2139                               monitor_update_resolv, false);    

which sets inotify watches for resolv.conf, if there is any modification of resolv.conf, the offline state is re-set. Please note I'm not claiming there is no bug in that code, I'm just stating resolv.conf changes should reset offline status.

Could we please also see a logfile from the [sssd] process itself?

Comment 11 Jakub Hrozek 2016-09-27 20:31:46 UTC

(In reply to Colin Simpson from comment #9)
> Making sssd start after nscd 

Is there a reason to use nscd together with sssd? Is nscd used for the maps that sssd serves (like passwd and group) or also others?

By the way in general starting sssd before NM (or before networking is up in general) shouldn't cause an issue since either the libnl integration or changes to resolv.conf should reset the offline status. But perhaps we have a bug..

Comment 12 Colin.Simpson 2016-09-28 13:46:42 UTC

(In reply to Jakub Hrozek from comment #11)

> Is there a reason to use nscd together with sssd? Is nscd used for the maps >that sssd serves (like passwd and group) or also others?

A I said above "I could maybe workaround by starting nscd (only doing host caching NO passwd, group, services or netgroup)". We need to use nscd for host caching, as we have some badly behaved third party apps that when certain processes fail can get into a constant restart loop, that hammers the nameservers.

I will generate the log for you.

Comment 13 Colin.Simpson 2016-09-28 14:37:48 UTC

Log file attached. 

Now I guess the interesting lines might be:

(Wed Sep 28 15:11:54 2016) [sssd] [monitor_config_file] (0x0080): Could not stat file [/etc/resolv.conf]. Error [2:No such file or directory]

Now we do have a line in a dispatcher and startup script that we use to clobber resolv.conf if we detect no valid IP's. This was due to issues in earlier Fedora/RHEL versions where boot would become interminably slow if there were stale entries still in resolv.conf. This seemed to happen if shutdown ON network and booted OFF network. Maybe no such issue exists these days.. 

e.g.


	cat >/etc/resolv.conf <<EOF
EOF

But I can't see this or anything we would do that would cause the file to not exist at any point. I guess it may depend on how NetworkManager updates resolv.conf or more correctly what it's symlinked to:

/etc/resolv.conf -> /var/run/NetworkManager/resolv.conf

I guess the monitor_config_file is upset it doesn't exist (for unknown reasons) and never bothers to recheck it again?

Comment 14 Colin.Simpson 2016-09-28 14:39:16 UTC

Created attachment 1205499 [details]
sssd.log

Comment 15 Colin.Simpson 2016-09-28 14:39:59 UTC

Actually it does say later:
(Wed Sep 28 15:25:15 2016) [sssd] [monitor_update_resolv] (0x0040): Resolv.conf has been updated. Reloading.
(Wed Sep 28 15:25:38 2016) [sssd] [monitor_update_resolv] (0x0040): Resolv.conf has been updated. Reloading.

So I guess something notices....

Comment 16 Jakub Hrozek 2016-09-29 07:49:38 UTC

Sorry about not noticing your nscd use-case, it's certainly valid to use it as long as you don't cache the sssd maps.

The logs are interesting, mainly this part:
(Wed Sep 28 15:25:32 2016) [sssd] [link_msg_handler] (0x1000): netlink link message: iface idx 3 (eno1) flags 0x11043 (broadcast,multicast,up,running,lowerup)
(Wed Sep 28 15:25:32 2016) [sssd] [has_phy_80211_subdir] (0x1000): No phy80211 directory in sysfs, probably not a wireless interface
(Wed Sep 28 15:25:32 2016) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status
(Wed Sep 28 15:25:32 2016) [sssd] [service_signal] (0x0020): Could not signal service [ukion.iongeo.com].

The failure comes from this code in the logs:
600     if (!svc->conn) {                                                                                                                                                                                         
601         /* Avoid a race condition where we are trying to                                                                                                                                                      
602          * order a service to reload that hasn't started                                                                                                                                                      
603          * yet.                                                                                                                                                                                               
604          */                                                                                                                                                                                                   
605         DEBUG(SSSDBG_CRIT_FAILURE,                                                                                                                                                                            
606               "Could not signal service [%s].\n", svc->name);                                                                                                                                                 
607         return EIO;                                                                                                                                                                                           
608     } 

So the sbus connection is not up yet, which is really strange after this long. Do you also have logs from the domain section that correspond to the same timeframe? I wonder if there is some issue bringing the service up..

I also don't see the service registering itself with the monitor. I do see it being queued for startup:
(Wed Sep 28 15:11:55 2016) [sssd] [start_service] (0x0100): Queueing service ukion.iongeo.com for startup

And responders do register themselves, like NSS here:
(Wed Sep 28 15:12:03 2016) [sssd] [sbus_message_handler] (0x2000): Received SBUS method org.freedesktop.sssd.monitor.RegisterService on path /org/freedesktop/sssd/monitor
(Wed Sep 28 15:12:03 2016) [sssd] [sbus_get_sender_id_send] (0x2000): Not a sysbus message, quit
(Wed Sep 28 15:12:03 2016) [sssd] [client_registration] (0x0100): Received ID registration: (nss,1)

I suspect something like a sudo renewal task or similar triggers a GSSAPI bind which blocks sssd_be for a long time b/c the network might not be up at that time and the service registration times out. But that's just a hunch.

Comment 17 Colin.Simpson 2016-09-29 14:53:04 UTC

I didn't have the domain  log file to match the main daemon one. So attached two new ones from a new run of this issue.

Comment 18 Colin.Simpson 2016-09-29 14:53:39 UTC

Created attachment 1205970 [details]
sssd.log Version 2

Comment 19 Colin.Simpson 2016-09-29 14:54:59 UTC

Created attachment 1205971 [details]
sssd_ukion.iongeo.com.log Version 2

Comment 20 Pavel Březina 2016-10-06 10:38:34 UTC

1) If resolv.conf is missing, sssd is trying to stat it every 10 seconds until it is found. So this error is not fatal.

2) Provider does not register to the monitor because it hangs while loading /usr/lib64/sssd/libsss_ad.so. Notice the timestamps, it takes it 6 seconds but timeout for registration is 5 seconds. 

Thu Sep 29 14:14:45 2016) [sssd[be[ukion.iongeo.com]]] [dp_module_open_lib] (0x1000): Loading module [ad] with path [/usr/lib64/sssd/libsss_ad.so]
(Thu Sep 29 14:14:51 2016) [sssd[be[ukion.iongeo.com]]] [dp_module_run_constructor] (0x0400): Executing module [ad] constructor.

Why does it take so long? Can you run sssd with strace as described at [1]?

[1] https://fedorahosted.org/sssd/wiki/DevelTips

Comment 21 Jakub Hrozek 2016-10-06 12:10:52 UTC

(In reply to Pavel Březina from comment #20)
> 1) If resolv.conf is missing, sssd is trying to stat it every 10 seconds
> until it is found. So this error is not fatal.
> 
> 2) Provider does not register to the monitor because it hangs while loading
> /usr/lib64/sssd/libsss_ad.so. Notice the timestamps, it takes it 6 seconds
> but timeout for registration is 5 seconds. 
> 
> Thu Sep 29 14:14:45 2016) [sssd[be[ukion.iongeo.com]]] [dp_module_open_lib]
> (0x1000): Loading module [ad] with path [/usr/lib64/sssd/libsss_ad.so]
> (Thu Sep 29 14:14:51 2016) [sssd[be[ukion.iongeo.com]]]
> [dp_module_run_constructor] (0x0400): Executing module [ad] constructor.
> 

I suspect this has to do with some initial (enumeration? subdomains) task that might try to bind to LDAP. IIRC the default libldap bind timeout is 6 seconds.

If the strace confirms that, I wonder if we should raise the default timeout?

> Why does it take so long? Can you run sssd with strace as described at [1]?
> 
> [1] https://fedorahosted.org/sssd/wiki/DevelTips

Comment 22 Pavel Březina 2016-10-06 12:27:29 UTC

No, 6 seconds takes to dlopen the module.

The associated code is:

    ret = dp_module_open_lib(module);
    if (ret != EOK) {
        goto done;
    }

    ret = dp_module_run_constructor(module, be_ctx, provider);
    if (ret != EOK) {
        goto done;
    }

At 14:14:45 we are in dp_module_open_lib, dlopening the library. At 14:14:51 we are in dp_module_run_constructor. There is no code from the module itself run in between.

Comment 23 Colin.Simpson 2016-10-06 13:46:38 UTC

It's not clear from the wiki page exactly what you'd like strace'd for this issue?

Comment 24 Jakub Hrozek 2016-10-06 14:04:35 UTC

(In reply to Colin Simpson from comment #23)
> It's not clear from the wiki page exactly what you'd like strace'd for this
> issue?

This section:
https://fedorahosted.org/sssd/wiki/DevelTips#UsingstracetotracktheSSSDprocesses
includes how to put a "command=" directive to sssd.conf that would enable us to strace the sssd_be process. Please note that SELinux must be Permissive for that to be permitted..

Comment 25 Jakub Hrozek 2016-10-06 14:05:32 UTC

(In reply to Pavel Březina from comment #22)
> No, 6 seconds takes to dlopen the module.
> 
> The associated code is:
> 
>     ret = dp_module_open_lib(module);
>     if (ret != EOK) {
>         goto done;
>     }
> 
>     ret = dp_module_run_constructor(module, be_ctx, provider);
>     if (ret != EOK) {
>         goto done;
>     }
> 
> At 14:14:45 we are in dp_module_open_lib, dlopening the library. At 14:14:51
> we are in dp_module_run_constructor. There is no code from the module itself
> run in between.

Then we really need the strace. I wonder if SELInux might be the culprit..

Comment 26 Colin.Simpson 2016-10-07 09:03:52 UTC

Sorry if I'm being dim here, the wiki says put the strace line in the domain section, as below:

[sssd]
domains = ukion.iongeo.com
config_file_version = 2
services = nss, pam, ifp
debug_level = 10
##command = strace -ff -o /tmp/sssd_be_strace /usr/libexec/sssd/sssd_be --debug-level=10 --domain ukion.iongeo.com

[domain/ukion.iongeo.com]
ad_domain = ukion.iongeo.com
krb5_realm = UKION.IONGEO.COM
realmd_tags = manages-system joined-with-samba 
cache_credentials = True
id_provider = ad
krb5_store_password_if_offline = True
default_shell = /bin/bash
# Change Local
ldap_id_mapping = False
# Change Local
use_fully_qualified_names = False
fallback_homedir = /home/%u
access_provider = ad
# Change Local - For local users
#override_homedir = /home/%u
debug_level = 10
command = strace -ff -o /tmp/sssd_be_strace /usr/libexec/sssd/sssd_be --debug-level=10 --domain ukion.iongeo.com

I have tried both the sections (swapping the "command=" line that is commented in the above. It starts in the setup above by doesn't work i.e. I get

# sssctl domain-status ukion.iongeo.com -o
Unable to get online status [3]: Communication error
org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Check that SSSD is running and the InfoPipe responder is enabled. Make sure 'ifp' is listed in the 'services' option in sssd.conf.
Unable to get online status

Comment 27 Pavel Březina 2016-10-10 10:17:15 UTC

Hi, given that provider is not registered in monitor, I believe this is expected behaviour. Or does it work without the "command" option? Please, remove existing logs /var/log/sssd/* and then reboot your machine. SSSD should be offline after reboot, there is no need to check it. We would like to see the generated logs (so we can match timestamps better) together with /tmp/sssd_be_strace. Configuration above looks good.

Comment 28 Colin.Simpson 2016-10-10 10:22:50 UTC

The "sssctl domain-status ukion.iongeo.com -o" works when the "command" option is commented out. That's my point, so I'm concerned my "command=" argument is incorrect.

Comment 29 Jakub Hrozek 2016-10-12 19:37:28 UTC

Does running with strace generate any logs, though? Could we see them if it does, please?

Comment 30 Lukas Slebodnik 2016-10-14 09:59:20 UTC

(In reply to Colin Simpson from comment #26)
> Sorry if I'm being dim here, the wiki says put the strace line in the domain
> section, as below:
> 
> [sssd]
> domains = ukion.iongeo.com
> config_file_version = 2
> services = nss, pam, ifp
> debug_level = 10
> ##command = strace -ff -o /tmp/sssd_be_strace /usr/libexec/sssd/sssd_be
> --debug-level=10 --domain ukion.iongeo.com
> 
> [domain/ukion.iongeo.com]
> ad_domain = ukion.iongeo.com
> krb5_realm = UKION.IONGEO.COM
> realmd_tags = manages-system joined-with-samba 
> cache_credentials = True
> id_provider = ad
> krb5_store_password_if_offline = True
> default_shell = /bin/bash
> # Change Local
> ldap_id_mapping = False
> # Change Local
> use_fully_qualified_names = False
> fallback_homedir = /home/%u
> access_provider = ad
> # Change Local - For local users
> #override_homedir = /home/%u
> debug_level = 10
> command = strace -ff -o /tmp/sssd_be_strace /usr/libexec/sssd/sssd_be
> --debug-level=10 --domain ukion.iongeo.com
> 
I think you miss options "--uid 0 --gid 0"

The best is copy&paste parameter from output of commands "pgrep -af sssd_be"
when the option command is not used in sssd.conf

Comment 31 Colin.Simpson 2016-12-14 12:17:17 UTC

This is still present on F25.

I did email you directly about this on the 14th November,  but you must have missed. 

The issue is (of course) this only happens at boot time. So I changed
the line in /etc/systemd/system/multi-user.target.wants/sssd.service

ExecStart=/usr/bin/strace -ff -tt -s256 -o /tmp/sssd_strace
/usr/sbin/sssd -D -f


and

Type=simple

And is does come up and strace this, but arghh it comes up online when you do this!

If I switch it back to the original way, it again comes up offline.....

Any idea how to trap this?

Comment 32 Pavel Březina 2016-12-14 12:56:22 UTC

11:19:18.011419 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 16
11:19:18.011442 stat("/var/lib/sss/pipes/private/sbus-dp_$domain", 0x7fff2be58770) = -1 ENOENT (No such file or directory)
11:19:18.011475 bind(16, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/private/sbus-dp_$domain"}, 56) = 0
11:19:28.595978 listen(16, 128)         = 0

This is the thing that is causing the initial delay so we should start from here. Why does it take ten seconds to bind? Does anyone have any idea?

I thought it may be because dbus is not yet started, but at this point we have only a simple unix socket so this won't be the case.

Comment 33 Jakub Hrozek 2016-12-14 21:13:36 UTC

Since this is a UNIX local socket, the bind actually creates the mapping between the filesystem name and the socket fd. But 10 seconds really seems excessive...I wonder if journald showed some errors on the system in general from around that time?

btw the internal "tech-list" mailing list might be a place to ask why can bind to af_unix take 10 seconds..

Comment 34 Lukas Slebodnik 2017-01-02 08:06:05 UTC

Colin,
is something suspicious in journald between 11:19:18 .. 11:19:28 ?

Comment 35 Jakub Hrozek 2017-01-02 08:11:33 UTC

(In reply to Lukas Slebodnik from comment #34)
> Colin,
> is something suspicious in journald between 11:19:18 .. 11:19:28 ?

btw I can forward you Colin's sssd logs he sent me privately. But I was never able to figure out anything suspicious from them..

Comment 36 Lukas Slebodnik 2017-01-02 09:50:10 UTC

(In reply to Jakub Hrozek from comment #35)
> (In reply to Lukas Slebodnik from comment #34)
> > Colin,
> > is something suspicious in journald between 11:19:18 .. 11:19:28 ?
> 
> btw I can forward you Colin's sssd logs he sent me privately. But I was
> never able to figure out anything suspicious from them..

I doubt I will find something there. i trust you.

BTW there is something suspicious on system.
Because once there is 10 seconds delay to dlopen file
(Thu Sep 29 14:14:45 2016) [sssd[be[ukion.iongeo.com]]] [dp_module_open_lib] (0x1000): Loading module [ad] with path [/usr/lib64/sssd/libsss_ad.so]
(Thu Sep 29 14:14:51 2016) [sssd[be[ukion.iongeo.com]]] [dp_module_run_constructor] (0x0400): Executing module [ad] constructor.

and other time there is a long delay to bind unix socket
11:19:18.011475 bind(16, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/private/sbus-dp_$domain"}, 56) = 0
11:19:28.595978 listen(16, 128)         = 0

Which are two different functions in sssd.

Colin, 
Do you have nfs or some other network file system on the machine?

If you boot macine more time is the 10 seconds delay every time in the same place?

Comment 37 Colin.Simpson 2017-01-16 13:27:24 UTC

This is with with RH developers Jakub Hrozek and Lukas Slebodnik, who have some new log files.

Comment 38 Fedora End Of Life 2017-07-25 23:15:17 UTC

This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 39 Colin.Simpson 2017-07-26 09:45:47 UTC

Not sure if RH developers saw anything from my F24 logs. 

But this still occurs for me on F26.

Comment 40 Fabiano Fidêncio 2018-04-11 15:38:33 UTC

Is this still happening with current Fedora?

Comment 41 Colin.Simpson 2018-04-12 10:13:45 UTC

Still happens on Fedora 27 on my test system. I will probably flatten this machine when F28 lands to see if some hangover through the many update cycles this machine has been through is causing this.

Comment 42 Colin.Simpson 2018-09-03 09:37:28 UTC

After a clean install of Fedora 28 this no longer occurs for me.

Comment 43 Jakub Hrozek 2018-10-08 07:01:22 UTC

I don't think this bug should be closed, we have similar (same?) reports from RHEL.

Comment 44 Lukas Slebodnik 2018-10-13 20:16:09 UTC

(In reply to Jakub Hrozek from comment #43)
> I don't think this bug should be closed, we have similar (same?) reports
> from RHEL.

Colin cannot reproduce it anymore.
If it is still reproducible then it will be better to track it in upstream.

Comment 45 Alexander E. Patrakov 2019-01-07 14:42:36 UTC

(Ubuntu user here, and it's important that they don't use nss_resolve, they just make /etc/resolv.conf a symlink to ../run/systemd/resolve/stub-resolv.conf)

If anyone still sees this bug, here are the suggested additional steps:

1. Verify that, in your setup, /etc/resolv.conf is a symlink to somewhere in /run or /var/run.
2. If so, when sssd is in offline mode and you think it should be online, run this: tcpdump -pni lo port 53

For me, it returns some packets for 127.0.0.1:53 at the points of time when sssd tries to go online. In other words, it sends packets to 127.0.0.1:53 even though resolv.conf points to some other DNS server. What seems important is that the file that the /etc/resolv.conf symlink points to does not exist when sssd starts up. So perhaps the resolv.conf reloading code needs to be audited for this specific case.

Note You need to log in before you can comment on or make changes to this bug.