Bug 1466934 - 5 seconds is *not* plenty of time to startup services on slow hardware
5 seconds is *not* plenty of time to startup services on slow hardware
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd (Show other bugs)
7.3
All Linux
unspecified Severity medium
: rc
: ---
Assigned To: SSSD Maintainers
sssd-qe
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-30 15:40 EDT by Orion Poplawski
Modified: 2017-09-15 03:24 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-15 03:24:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sssd.log (3.14 KB, text/plain)
2017-07-24 10:11 EDT, hdunkel
no flags Details
sssd_nss.log (2.48 KB, text/plain)
2017-07-24 10:12 EDT, hdunkel
no flags Details
sssd_domain.log (8.83 KB, text/plain)
2017-07-25 02:33 EDT, hdunkel
no flags Details
sssd logs (45.77 KB, application/x-gzip)
2017-08-24 18:38 EDT, Orion Poplawski
no flags Details

  None (edit)
Description Orion Poplawski 2017-06-30 15:40:58 EDT
Description of problem:

I'm seeing this fairly regularly on some older hardware (Xeon X5450) and many vms (startup can be slow when lots of vms are booting together).

(Thu Jun 29 04:01:46 2017) [sssd] [start_service] (0x0100): Queueing service nwra.com for startup
(Thu Jun 29 04:01:51 2017) [sssd] [services_startup_timeout] (0x0020): Providers did not start in time, forcing services startup!

and on at least one occasion sssd failed to startup:

Jun 29 04:01:45 aspen sssd: Starting up
Jun 29 04:01:49 aspen sssd[be[nwra.com]]: Starting up
Jun 29 04:01:54 aspen sssd[ssh]: Starting up
Jun 29 04:01:56 aspen sssd[sudo]: Starting up
Jun 29 04:01:56 aspen sssd[pam]: Starting up
Jun 29 04:01:56 aspen sssd[autofs]: Starting up
Jun 29 04:01:56 aspen sssd[ssh]: Starting up
Jun 29 04:01:56 aspen sssd[nss]: Starting up
Jun 29 04:01:56 aspen sssd[pac]: Starting up
Jun 29 04:01:56 aspen sssd[pac]: Starting up
Jun 29 04:01:56 aspen sssd[sudo]: Starting up
Jun 29 04:01:56 aspen sssd[autofs]: Starting up
Jun 29 04:01:58 aspen sssd[ssh]: Starting up
Jun 29 04:01:58 aspen sssd[pac]: Starting up
Jun 29 04:01:58 aspen sssd[sudo]: Starting up
Jun 29 04:01:58 aspen sssd[autofs]: Starting up
Jun 29 04:01:59 aspen sssd[nss]: Starting up
Jun 29 04:02:01 aspen sssd[nss]: Starting up
Jun 29 04:02:02 aspen sssd[ssh]: Starting up
Jun 29 04:02:02 aspen sssd: Exiting the SSSD. Could not restart critical service [ssh].
Jun 29 04:02:04 aspen sssd[pam]: Starting up
Jun 29 04:02:08 aspen sssd[be[nwra.com]]: Shutting down
Jun 29 04:02:08 aspen systemd: sssd.service: control process exited, code=exited status=1

# cat /var/log/sssd/sssd_ssh.log
(Thu Jun 29 04:01:54 2017) [sssd[ssh]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Thu Jun 29 04:01:54 2017) [sssd[ssh]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Thu Jun 29 04:01:54 2017) [sssd[ssh]] [ssh_process_init] (0x0010): sss_process_init() failed
(Thu Jun 29 04:01:56 2017) [sssd[ssh]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Thu Jun 29 04:01:56 2017) [sssd[ssh]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Thu Jun 29 04:01:56 2017) [sssd[ssh]] [ssh_process_init] (0x0010): sss_process_init() failed
(Thu Jun 29 04:01:58 2017) [sssd[ssh]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Thu Jun 29 04:01:58 2017) [sssd[ssh]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Thu Jun 29 04:01:58 2017) [sssd[ssh]] [ssh_process_init] (0x0010): sss_process_init() failed
(Thu Jun 29 04:02:02 2017) [sssd[ssh]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Thu Jun 29 04:02:02 2017) [sssd[ssh]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Thu Jun 29 04:02:02 2017) [sssd[ssh]] [ssh_process_init] (0x0010): sss_process_init() failed

Version-Release number of selected component (if applicable):
sssd-1.14.0-43.el7_3.14.x86_64
Comment 2 Pavel Březina 2017-07-03 04:48:59 EDT
Unfortunately this timeout is currently hardcoded. But I wonder why we fail to start ssh responder, can you try it with higher debug level (like 0x3ff0) and send us the monitor and ssh responder logs please?
Comment 3 hdunkel 2017-07-13 06:54:12 EDT
I wonder what this timeout is actually good for? If a service cannot be started (i.e. fork() and exec()), then it should try again or ignore it. Surely its a bad idea to stop all of sssd just because there is a hiccup with ssh support, for example.

Please note that you also might get this problem on a SIGHUP (e.g. for logrotate), because sssd seems to stop and start all services again, instead of just rereading the config files and creating new logfiles.
Comment 4 Jakub Hrozek 2017-07-13 14:52:20 EDT
Well, I'm perplexed about the behaviour for two reasons:
1) I thought we implemented a sort of an exponential backoff with respect to the timeout. IIRC we tried waiting for the timeout, then 2*timeout and finally 3*timeout before giving up. But maybe I'm confusing this with the timeout between the main sssd process and the back ends, not between the services and the back ends

2) it really is bad that sssd shut down. At the very least now when NSS is a systemd service, even if we gave up on that particular startup of NSS, the next request should restart NSS (or rather, systemd should start the NSS service). But also here I'm not sure if this holds even if NSS is explicitly configured in the config file.

(tl;dr this needs testing and code inspection)
Comment 5 hdunkel 2017-07-14 07:40:04 EDT
Please don't create systemd-only code. That was the major concern against systemd.
Comment 6 Lukas Slebodnik 2017-07-17 05:28:15 EDT
(In reply to hdunkel from comment #5)
> Please don't create systemd-only code. That was the major concern against
> systemd.

We don't. Jakub should have written:
2) it really is bad that sssd shut down. At the very least now when NSS *CAN BE* a systemd service
Comment 7 hdunkel 2017-07-17 09:56:16 EDT
Now what is this supposed to mean? Is it less "really bad" if its not a systemd service?
Comment 8 hdunkel 2017-07-24 07:27:48 EDT
sssd died again. This is a *huge* problem for a MDA. No users --> EMails are bounced to the sender.

Please provide a fix asap.
Thanx
Comment 9 Lukas Slebodnik 2017-07-24 08:26:11 EDT
(In reply to hdunkel from comment #8)
> sssd died again. This is a *huge* problem for a MDA. No users --> EMails are
> bounced to the sender.
> 
> Please provide a fix asap.
> Thanx

I really appreciate a tone without providing any relevant info.
@see comment 2
https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html 

If you would like to have a fix ASAP then I would recommend to contact Red Hat support and escalate it there. BTW there is also an option to speed this up via providing patch to upstream if you do not have a RHEL subscription.
https://docs.pagure.org/SSSD.sssd/developers/contribute.html
Comment 10 hdunkel 2017-07-24 10:11 EDT
Created attachment 1303691 [details]
sssd.log

My appologies, I was really pissed. Attached you can find sssd.log. The debug level was set to 5. I will attach the nss_log files as well.
Comment 11 hdunkel 2017-07-24 10:12 EDT
Created attachment 1303692 [details]
sssd_nss.log
Comment 12 Lukas Slebodnik 2017-07-24 11:37:13 EDT
(Mon Jul 24 13:09:32 2017) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Mon Jul 24 13:09:32 2017) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Mon Jul 24 13:09:32 2017) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed

Hmm, it is a little bit weird.

What kind of filesystem do you have /var/lib/sss ?

Could you increase debug_level to 9 in "sssd" and "nss" sections of sssd.conf?
Because log files are quite minimal.
Comment 13 Jakub Hrozek 2017-07-24 12:40:58 EDT
I think it would also be good to provide verbose logs from the domain section but also the sssd.log itself, because it would be interesting to see what the other processes were doing, especially if there was any slowdown in the sssd_be process.
Comment 14 hdunkel 2017-07-25 02:32:44 EDT
The sssd.log is already in, see the second last attachment. I will attach the daemon log file.

/var/lib/sss is an ext4 on a drbd partition (master-backup, "policy C", i.e. write through). The host is a LXC container

srvvm01:~# df /var/lib/sss/
Filesystem      1K-blocks       Used  Available Use% Mounted on
/dev/drbd1     3714878556 1741319068 1784831400  50% /
srvvm01:~# mount | grep drbd1
/dev/drbd1 on / type ext4 (rw,noatime,data=ordered)

I am not sure about debug level 9. I have to make sure that no sensitive information appears in the log files to post them here. Do you think that debug_level 6 would be sufficient?
Comment 15 hdunkel 2017-07-25 02:33 EDT
Created attachment 1304034 [details]
sssd_domain.log
Comment 17 Fabiano Fidêncio 2017-08-15 14:43:39 EDT
I really failed to see what may be happening here with the debug level provided.

As suggested by Lukas on comment #12 having the logs with debug level set to 9 would be really appreciated here.

I understand that filtering out sensitive information is not that fun, but that would help us a lot to understand better the problem.

I'm setting needinfo flag for hdunkel@aixigo.de.
Comment 18 Jakub Hrozek 2017-08-23 15:28:45 EDT
Orion,
could you please attach log files from one of the affected systems?

Harald,
if the log files sanitization is too much work or error-prone, perhaps you can send the log files directly to us?
Comment 19 Fabiano Fidêncio 2017-08-24 11:37:08 EDT
Orion, Harald,

Would be nice to see the output of `journalctl -xe` as well when the failure happens.
Comment 20 Orion Poplawski 2017-08-24 18:38 EDT
Created attachment 1317929 [details]
sssd logs

Here are some sssd logs with debug = 9 for sssd, be, nss, and pam.  It hit the service timeouts, but managed to startup okay.

Of note:
Aug 24 16:11:00 aspen sssd: Starting up
Aug 24 16:11:00 aspen sssd[be[nwra.com]]: Starting up

nwra.com log has as first two entries:

(Thu Aug 24 16:11:00 2017) [sssd[be[nwra.com]]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb
(Thu Aug 24 16:11:09 2017) [sssd[be[nwra.com]]] [dp_get_options] (0x0400): Option lookup_family_order has value ipv4_first

so it took 9 seconds to progress to the second line.

(Thu Aug 24 16:11:00 2017) [sssd] [start_service] (0x0100): Queueing service nwra.com for startup
(Thu Aug 24 16:11:05 2017) [sssd] [services_startup_timeout] (0x0400): Handling timeout
(Thu Aug 24 16:11:05 2017) [sssd] [services_startup_timeout] (0x0020): Providers did not start in time, forcing services startup!
(Thu Aug 24 16:11:05 2017) [sssd] [services_startup_timeout] (0x0100): Now starting services!
(Thu Aug 24 16:11:05 2017) [sssd] [start_service] (0x0100): Queueing service nss for startup
(Thu Aug 24 16:11:05 2017) [sssd] [start_service] (0x0100): Queueing service sudo for startup
(Thu Aug 24 16:11:05 2017) [sssd] [start_service] (0x0100): Queueing service pam for startup
(Thu Aug 24 16:11:05 2017) [sssd] [start_service] (0x0100): Queueing service ssh for startup
(Thu Aug 24 16:11:05 2017) [sssd] [start_service] (0x0100): Queueing service autofs for startup
(Thu Aug 24 16:11:05 2017) [sssd] [start_service] (0x0100): Queueing service pac for startup

Looks like this is when the other services are first started.  They keep failing until presumably the backend is up.  Perhaps this is expected behavior.  I'll try to see if I can get sssd to fail completely to startup again.
Comment 21 Fabiano Fidêncio 2017-08-25 02:09:16 EDT
(In reply to Orion Poplawski from comment #20)
[snip]
> 
> Looks like this is when the other services are first started.  They keep
> failing until presumably the backend is up.  Perhaps this is expected
> behavior.  I'll try to see if I can get sssd to fail completely to startup
> again.

Orion,

It's indeed the expected behaviour for cases where the data providers take more than 5 seconds to start. So, hitting this code is expected in those situations.

I do have a patch for making this timeout configurable, but I'm really not sure whether it's a good thing to have or not.
Comment 22 Jakub Hrozek 2017-08-27 15:53:32 EDT
(In reply to Fabiano Fidêncio from comment #21)
> (In reply to Orion Poplawski from comment #20)
> [snip]
> > 
> > Looks like this is when the other services are first started.  They keep
> > failing until presumably the backend is up.  Perhaps this is expected
> > behavior.  I'll try to see if I can get sssd to fail completely to startup
> > again.
> 
> Orion,
> 
> It's indeed the expected behaviour for cases where the data providers take
> more than 5 seconds to start. So, hitting this code is expected in those
> situations.
> 
> I do have a patch for making this timeout configurable, but I'm really not
> sure whether it's a good thing to have or not.

It is an option and if it fixes the bug, then we should explore it.
At the very least, we might want to give the build to Orion to confirm that increasing the timeout fixes the issue.

I would still prefer if the timeout was autoconfigurable and we tried X, then 2X, then 4X before giving up, but at the same time, I haven't dug into the code myself, so I'm not even sure if this makes sense..
Comment 23 Fabiano Fidêncio 2017-08-28 01:36:11 EDT
(In reply to Jakub Hrozek from comment #22)
> (In reply to Fabiano Fidêncio from comment #21)
> > (In reply to Orion Poplawski from comment #20)
> > [snip]
> > > 
> > > Looks like this is when the other services are first started.  They keep
> > > failing until presumably the backend is up.  Perhaps this is expected
> > > behavior.  I'll try to see if I can get sssd to fail completely to startup
> > > again.
> > 
> > Orion,
> > 
> > It's indeed the expected behaviour for cases where the data providers take
> > more than 5 seconds to start. So, hitting this code is expected in those
> > situations.
> > 
> > I do have a patch for making this timeout configurable, but I'm really not
> > sure whether it's a good thing to have or not.
> 
> It is an option and if it fixes the bug, then we should explore it.
> At the very least, we might want to give the build to Orion to confirm that
> increasing the timeout fixes the issue.
> 
> I would still prefer if the timeout was autoconfigurable and we tried X,
> then 2X, then 4X before giving up, but at the same time, I haven't dug into
> the code myself, so I'm not even sure if this makes sense..

What worries me a little bit about patching this issue in that way is:
- We may be actually hiding a real issue in the responders' code by just increasing this timeout in the monitor code;
- In case we don't handle this configurable option in a really nice way we may end up hitting systemd's default timeout for starting the service (which is 90 second);

Personally I'd prefer to not change anything in the code till we have logs of this very specific situation (which I tried to reproduce with a really slow VM, but was only able to when using a so small amount of memory that caused some OOM crashes in NetworkManager).
Comment 24 hdunkel 2017-08-28 03:26:50 EDT
Please consider, it is not uncommon that a job takes more than 20 seconds just to be swapped in. IMHO 5 secs is not sufficient, esp. if you rely upon some ipc to a remote ldap service.

I don't like the timeout approach. It is not an error if a service takes 30 minutes to start (indirectly waiting for a database check, for example). Such a long delay is surely unfortunate, but the infrastructure should be able to handle this.
Comment 25 Fabiano Fidêncio 2017-08-28 03:39:30 EDT
(In reply to hdunkel from comment #24)
> Please consider, it is not uncommon that a job takes more than 20 seconds
> just to be swapped in. IMHO 5 secs is not sufficient, esp. if you rely upon
> some ipc to a remote ldap service.

Yes, keep in mind that I'm considering this. On the other hand, blindly increasing the timeout may not be the best solution (this, I've asked you for some logs that will help us to find the best solution for this case).

> 
> I don't like the timeout approach. It is not an error if a service takes 30
> minutes to start (indirectly waiting for a database check, for example).
> Such a long delay is surely unfortunate, but the infrastructure should be
> able to handle this.

Indeed. But, again, having the logs of this situation when it happens is crucial for us to provide the best suitable fix for this situation. And just letting the user to blindly increase the timeout may even introduce more issues/bug reports.

So, please, help us to help you and provide us the logs asked in the comments #12, #17, #18 and #19.

As Jakub already mentioned, feel free to send the logs directly to me in case you feel like removing sensitive data would be too much painful for you.
Comment 26 Fabiano Fidêncio 2017-09-15 03:24:56 EDT
Due to the lack of information, this bug has been closed as INSUFFICIENT_DATA.

Please, feel free to re-open this bug whenever it happens again and you're able to provide the logs asked in the comment #12, comment #17, comment #18 and comment #19.

Note You need to log in before you can comment on or make changes to this bug.