Bug 1364033

Summary:	sssd exits if clock is adjusted backwards after boot
Product:	Red Hat Enterprise Linux 7	Reporter:	Patrik Kis <pkis>
Component:	sssd	Assignee:	Pavel Březina <pbrezina>
Status:	CLOSED ERRATA	QA Contact:	Steeve Goveas <sgoveas>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.3	CC:	grajaiya, jbastian, jhrozek, lslebodn, mkosek, mzidek, pbrezina, pkis, sgadekar
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:	sssd-1.14.0-30.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-04 07:20:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Patrik Kis 2016-08-04 10:39:13 UTC

Description of problem:
The sssd daemon does not start on aarch64 architecture after reboot even if it was enabled. Other architectures are not affected.
Switching selinux to permissive mode does not helps. There are not AVC denials BTW.
It started with the rebased version. sssd-1.13.0-40.el7 works fine.

Version-Release number of selected component (if applicable):
sssd-1.14.0-14.el7.aarch64

How reproducible:
always

Steps to Reproduce:
1. Configure sssd with realmd. Either IPA or AD, does not matter for this test.

# realm -v join --user=admin <ipa.domain>

... snip ...

Client configuration complete.

 * /usr/bin/systemctl enable sssd.service
 * /usr/bin/systemctl restart sssd.service
 * /usr/bin/sh -c /usr/sbin/authconfig --update --enablesssd --enablesssdauth --enablemkhomedir --nostart && /usr/bin/systemctl enable oddjobd.service && /usr/bin/systemctl start oddjobd.service
 * Successfully enrolled machine in realm

#
# ll /etc/systemd/system/multi-user.target.wants/sssd.service 
lrwxrwxrwx. 1 root root 36 Aug  4 05:57 /etc/systemd/system/multi-user.target.wants/sssd.service -> /usr/lib/systemd/system/sssd.service

2. Check that sssd is running and reboot

3. After reboot check sssd status

# uptime
 06:20:25 up 7 min,  1 user,  load average: 0.00, 0.05, 0.05
# service sssd status
Redirecting to /bin/systemctl status  sssd.service
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sssd.service.d
           └─journal.conf
   Active: failed (Result: exit-code) since Thu 2016-08-04 06:13:59 EDT; 6min ago
  Process: 756 ExecStart=/usr/sbin/sssd -D -f (code=exited, status=0/SUCCESS)
 Main PID: 779 (code=exited, status=1/FAILURE)

# ausearch -m avc  -ts recent
<no matches>
#


4. If sssd si started manually it just start and works as expected

# service sssd start
Redirecting to /bin/systemctl start  sssd.service
# service sssd status
Redirecting to /bin/systemctl status  sssd.service
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sssd.service.d
           └─journal.conf
   Active: active (running) since Thu 2016-08-04 06:22:23 EDT; 5s ago
  Process: 2554 ExecStart=/usr/sbin/sssd -D -f (code=exited, status=0/SUCCESS)
 Main PID: 2556 (sssd)
   CGroup: /system.slice/sssd.service
           ├─2556 /usr/sbin/sssd -D -f
           ├─2557 /usr/libexec/sssd/sssd_be --domain ipa.baseos.qe --uid 0 --gid 0 --debug-to-files
           ├─2558 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
           ├─2559 /usr/libexec/sssd/sssd_sudo --uid 0 --gid 0 --debug-to-files
           ├─2560 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files
           ├─2561 /usr/libexec/sssd/sssd_ssh --uid 0 --gid 0 --debug-to-files
           ├─2562 /usr/libexec/sssd/sssd_pac --uid 0 --gid 0 --debug-to-files
           └─2565 /usr/bin/nsupdate -g -d -D

Comment 2 Jakub Hrozek 2016-08-04 11:31:26 UTC

Looking at the logs, the IPA server is unreachable when sssd_be starts. Does the bug happen after you start SSSD when the IPA server is up?

(I'm thinking this might be upstream #3111)

Comment 3 Patrik Kis 2016-08-04 12:10:29 UTC

The IPA server is up all the time, it's completely independent of the client. And I tested this a few times already also with AD and it was always the same.
So I don't think it's #3111, only if for some reason SSSD can not communicate with IPA when the machine is booting and at the end it behaves as in #3111.

Note, it happen only on aarch64 what is a quite new platform.

Comment 5 Jakub Hrozek 2016-08-04 14:27:01 UTC

I still suspect #3111 because the aarch machine is presumably quite a bit slower than the other architectures. Because we know how to fix that ticket, I would prefer to try with test packages for #3111 when they are available (hopefully really soon now).

Comment 6 Patrik Kis 2016-08-05 07:06:56 UTC

Ok, I'll be happy to test it with a test package.

Comment 7 Lukas Slebodnik 2016-08-05 09:59:24 UTC

Pavel,
you wrote a patch.
Could you prepare a scratch build with your patch?

Comment 12 Pavel Březina 2016-08-09 09:53:17 UTC

This is different than #3111. Backend starts normally, but network is unreachable during boot so backend goes offline (probably wrong order in systemd?). However both sssd and sssd_be is killed by watchdog which is interesting since according to logs at least sssd_be is working. There is long delay in sssd.log but that may be just because we didn't write anything there.

Comment 13 Jakub Hrozek 2016-08-09 09:59:44 UTC

OK, then I guess we need to clone and investigate this as a new bug. Can you take a look at Patrik's machine?

Comment 14 Pavel Březina 2016-08-09 10:35:31 UTC

That may be problematic. Since this occurs only on boot when a network is unreachable. I can't debug it during boot time and I can't disable network since I'll loose access to the machine. I will provide a test package to see when watchdog ticks are happening.

Comment 18 Pavel Březina 2016-08-11 11:28:17 UTC

Watchdog tevent timer is not fired for some reason. It is created but never triggered in all sssd processes. However there is no reason for us to hang in the main loop.

(In reply to Jakub Hrozek from comment #5)
> I still suspect #3111 because the aarch machine is presumably quite a bit
> slower than the other architectures. Because we know how to fix that ticket,
> I would prefer to try with test packages for #3111 when they are available
> (hopefully really soon now).

May it be because this architecture is "presumably slower" that we can't rely on those timers? But range of 30+ seconds is quite a long time to suspect this...

Do we have any tevent known bug?

Comment 19 Jakub Hrozek 2016-08-11 11:37:28 UTC

(In reply to Pavel Březina from comment #18)
> Watchdog tevent timer is not fired for some reason. It is created but never
> triggered in all sssd processes. However there is no reason for us to hang
> in the main loop.
> 
> (In reply to Jakub Hrozek from comment #5)
> > I still suspect #3111 because the aarch machine is presumably quite a bit
> > slower than the other architectures. Because we know how to fix that ticket,
> > I would prefer to try with test packages for #3111 when they are available
> > (hopefully really soon now).
> 
> May it be because this architecture is "presumably slower" that we can't
> rely on those timers? But range of 30+ seconds is quite a long time to
> suspect this...
> 
> Do we have any tevent known bug?

No, we don't. Does this bug only happen the first time sssd starts during boot and never again?

Comment 20 Patrik Kis 2016-08-12 07:22:36 UTC

(In reply to Jakub Hrozek from comment #19)
> 
> No, we don't. Does this bug only happen the first time sssd starts during
> boot and never again?

Correct, the bug happen only during the first start during boot, any subsequent manual start with systemctl succeeds.

Comment 21 Pavel Březina 2016-08-12 10:47:06 UTC

I can confirm that. I can't reproduce it locally.  How are we going to debug it during boot time?

I checked tevent timer code and it's completely independent on any system call.

Comment 22 Jakub Hrozek 2016-08-12 11:10:52 UTC

The only idea I have is to use command=strace ... or similar to see what exactly is happening there.

Alternatively, we can wait until Fabiano finishes https://fedorahosted.org/sssd/ticket/3051 and then use diag_cmd to see why is sssd stuck.

Comment 24 Jeff Bastian 2016-08-17 21:49:58 UTC

I tried starting sssd through strace like this:

[root@bumblebee ~]# cat /etc/systemd/system/sssd.service.d/strace.conf
[Service]
ExecStart=
ExecStart=/usr/bin/strace -f -o /var/tmp/sssd-strace /usr/sbin/sssd -D -f


But it was too slow running under strace and it failed with:

[root@bumblebee ~]# systemctl restart sssd.service
Job for sssd.service failed because a timeout was exceeded. See "systemctl status sssd.service" and "journalctl -xe" for details.
[root@bumblebee ~]# systemctl status sssd.service -l
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sssd.service.d
           └─journal.conf, strace.conf
   Active: failed (Result: timeout) since Wed 2016-08-17 16:45:31 CDT; 3min 43s ago
  Process: 9766 ExecStart=/usr/bin/strace -f -o /var/tmp/sssd-strace /usr/sbin/sssd -D -f (code=exited, status=0/SUCCESS)
 Main PID: 9690 (code=exited, status=0/SUCCESS)

Aug 17 16:44:01 bumblebee.localdomain sssd[nss][9778]: Starting up
Aug 17 16:45:31 bumblebee.localdomain systemd[1]: sssd.service start operation timed out. Terminating.
Aug 17 16:45:31 bumblebee.localdomain sssd[nss][9778]: Shutting down
Aug 17 16:45:31 bumblebee.localdomain sssd[autofs][9780]: Shutting down
Aug 17 16:45:31 bumblebee.localdomain sssd[pam][9779]: Shutting down
Aug 17 16:45:31 bumblebee.localdomain sssd[be[default]][9776]: Shutting down
Aug 17 16:45:31 bumblebee.localdomain sssd[be[REDHAT]][9777]: Shutting down
Aug 17 16:45:31 bumblebee.localdomain systemd[1]: Failed to start System Security Services Daemon.
Aug 17 16:45:31 bumblebee.localdomain systemd[1]: Unit sssd.service entered failed state.
Aug 17 16:45:31 bumblebee.localdomain systemd[1]: sssd.service failed.

Comment 25 Jakub Hrozek 2016-08-18 08:24:17 UTC

If you can reproduce the bug as well, can you please do the following instead?

1) start sssd manually and find out exactly how sssd_be is running
ps -ef | grep sssd_be
it would be something like:
/usr/libexec/sssd/sssd_be --domain redhat.com --uid 0 --gid 0 --debug-to-files
except the domain would be different

2) open the sssd.conf and add the following to the [domain] section:
command = strace -o /tmp/sssd_be_strace -ff /usr/libexec/sssd/sssd_be --domain redhat.com --uid 0 --gid 0 --debug-to-files

3) set SELinux to permissive (otherwise sssd won't be permitted to execute strace) and try out restarting the sssd service to make sure the config is correct

4) reboot, then attach /tmp/sssd_be_strace_* here

Thank you!

Comment 26 Pavel Březina 2016-08-18 13:20:49 UTC

Lukas just suggested an interesting idea: is there a possibility that system time gets changed during the boot?

Comment 27 Jeff Bastian 2016-08-18 14:23:48 UTC

I'll try the steps in comment 25 in a few minutes.  First, I noticed that sssd does actually run for a few seconds on boot, and if you login very quickly after reboot, you can catch it before it dies.

$ ssh bumblebee.example.com
Last login: Thu Aug 18 08:52:54 2016 from foo.example.com
[jbastian@bumblebee ~]$ ps ax | grep sss
 1173 ?        Ss     0:01 /usr/sbin/sssd -D -f
 1174 ?        S      0:00 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 --debug-to-files
 1175 ?        S      0:00 /usr/libexec/sssd/sssd_be --domain FOO --uid 0 --gid 0 --debug-to-files
 1176 ?        S      0:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
 1177 ?        S      0:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files
 1178 ?        S      0:00 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --debug-to-files
 3073 pts/0    S+     0:00 grep --color=auto sss
[jbastian@bumblebee ~]$ uptime
 09:19:45 up 1 min,  1 user,  load average: 1.61, 0.55, 0.20
[jbastian@bumblebee ~]$ ps ax | grep sss
 3098 pts/0    S+     0:00 grep --color=auto sss

Comment 28 Jeff Bastian 2016-08-18 14:27:44 UTC

(In reply to Pavel Březina from comment #26)
> Lukas just suggested an interesting idea: is there a possibility that system
> time gets changed during the boot?

It should not be changing (much) during boot.  The systems all have an RTC so the clock is reasonable close during boot, and chronyd is enabled so any adjustments should be very minor, maybe a few minutes if the RTC has drifted.

From the reboot in comment 27:

[jbastian@bumblebee ~]$ journalctl -b | egrep -i 'rtc|chrony'
Aug 18 09:19:40 bumblebee.localdomain kernel: rtc-efi rtc-efi: rtc core: registered rtc-efi as rtc0
Aug 18 09:19:40 bumblebee.localdomain kernel: rtc-efi rtc-efi: setting system clock to 2016-08-18 14:19:40 UTC (1471529980)
Aug 18 09:20:15 bumblebee.localdomain chronyd[1116]: chronyd version 2.1.1 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +DEBUG +ASYNCDNS +IPV6 +SECHASH)
Aug 18 09:20:15 bumblebee.localdomain chronyd[1116]: Frequency 26.946 +/- 0.075 ppm read from /var/lib/chrony/drift
Aug 18 09:20:29 bumblebee.localdomain chronyd[1116]: Selected source 10.5.26.10
Aug 18 09:20:29 bumblebee.localdomain chronyd[1116]: System clock wrong by -78.185003 seconds, adjustment started
Aug 18 09:19:11 bumblebee.localdomain chronyd[1116]: System clock was stepped by -78.185003 seconds

Is a difference of 1min 18sec enough to cause this problem?

Comment 29 Lukas Slebodnik 2016-08-18 14:55:34 UTC

(In reply to Jeff Bastian from comment #28)
> (In reply to Pavel Březina from comment #26)
> > Lukas just suggested an interesting idea: is there a possibility that system
> > time gets changed during the boot?
> 
> It should not be changing (much) during boot.  The systems all have an RTC
> so the clock is reasonable close during boot, and chronyd is enabled so any
> adjustments should be very minor, maybe a few minutes if the RTC has drifted.
> 
> From the reboot in comment 27:
> 
> [jbastian@bumblebee ~]$ journalctl -b | egrep -i 'rtc|chrony'
> Aug 18 09:19:40 bumblebee.localdomain kernel: rtc-efi rtc-efi: rtc core:
> registered rtc-efi as rtc0
> Aug 18 09:19:40 bumblebee.localdomain kernel: rtc-efi rtc-efi: setting
> system clock to 2016-08-18 14:19:40 UTC (1471529980)
> Aug 18 09:20:15 bumblebee.localdomain chronyd[1116]: chronyd version 2.1.1
> starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +DEBUG +ASYNCDNS +IPV6
> +SECHASH)
> Aug 18 09:20:15 bumblebee.localdomain chronyd[1116]: Frequency 26.946 +/-
> 0.075 ppm read from /var/lib/chrony/drift
> Aug 18 09:20:29 bumblebee.localdomain chronyd[1116]: Selected source
> 10.5.26.10
> Aug 18 09:20:29 bumblebee.localdomain chronyd[1116]: System clock wrong by
> -78.185003 seconds, adjustment started
> Aug 18 09:19:11 bumblebee.localdomain chronyd[1116]: System clock was
> stepped by -78.185003 seconds
> 
> Is a difference of 1min 18sec enough to cause this problem?

IMHO, the best would be to disable chronyd and try reboot.
If the issue is gone then the bug is caused by time shift.

Comment 30 Jeff Bastian 2016-08-18 16:10:03 UTC

That fixed it!  sssd is still running 11 minutes after boot when I disabled chronyd:

[jbastian@bumblebee ~]$ uptime
 11:10:40 up 11 min,  1 user,  load average: 0.00, 0.17, 0.20
[jbastian@bumblebee ~]$ systemctl status chronyd.service sssd.service
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sssd.service.d
           └─journal.conf, strace.conf
   Active: active (running) since Thu 2016-08-18 11:00:49 CDT; 9min ago
  Process: 2020 ExecStart=/usr/sbin/sssd -D -f (code=exited, status=0/SUCCESS)
 Main PID: 2057 (sssd)
   CGroup: /system.slice/sssd.service
           ├─2057 /usr/sbin/sssd -D -f
           ├─2061 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 ...
           ├─2062 /usr/libexec/sssd/sssd_be --domain REDHAT --uid 0 --gid 0 -...
           ├─2063 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files...
           ├─2064 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files...
           └─2065 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --debug-to-fi...

Aug 18 11:00:46 bumblebee.localdomain systemd[1]: Starting System Security Se...
Aug 18 11:00:48 bumblebee.localdomain sssd[2057]: Starting up
Aug 18 11:00:49 bumblebee.localdomain sssd[be[default]][2061]: Starting up
Aug 18 11:00:49 bumblebee.localdomain sssd[be[REDHAT]][2062]: Starting up
Aug 18 11:00:49 bumblebee.localdomain sssd[be[REDHAT]][2062]: Your configurat...
Aug 18 11:00:49 bumblebee.localdomain sssd[be[default]][2061]: Your configura...
Aug 18 11:00:49 bumblebee.localdomain sssd[pam][2064]: Starting up
Aug 18 11:00:49 bumblebee.localdomain sssd[nss][2063]: Starting up
Aug 18 11:00:49 bumblebee.localdomain sssd[autofs][2065]: Starting up
Aug 18 11:00:49 bumblebee.localdomain systemd[1]: Started System Security Ser...
Hint: Some lines were ellipsized, use -l to show in full.

Comment 32 Jeff Bastian 2016-08-18 16:39:28 UTC

Next, I synchronized the RTC to the wall time to minimize the amount of chronyd adjustment, and sssd stayed alive.

[root@bumblebee ~]# systemctl enable chronyd.service
[root@bumblebee ~]# systemctl start chronyd.service
[root@bumblebee ~]# hwclock --systohc
[root@bumblebee ~]# reboot
...
[root@bumblebee ~]# uptime
 11:37:59 up 5 min,  1 user,  load average: 0.04, 0.30, 0.19
[root@bumblebee ~]# journalctl -b0 | egrep -i 'rtc|chrony'
Aug 18 11:32:21 bumblebee.localdomain kernel: rtc-efi rtc-efi: rtc core: registered rtc-efi as rtc0
Aug 18 11:32:21 bumblebee.localdomain kernel: rtc-efi rtc-efi: setting system clock to 2016-08-18 16:32:21 UTC (1471537941)
Aug 18 11:32:51 bumblebee.localdomain chronyd[1087]: chronyd version 2.1.1 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +DEBUG +ASYNCDNS +IPV6 +SECHASH)
Aug 18 11:32:51 bumblebee.localdomain chronyd[1087]: Frequency 26.963 +/- 1.452 ppm read from /var/lib/chrony/drift
Aug 18 11:33:06 bumblebee.localdomain chronyd[1087]: Selected source 10.5.27.10
Aug 18 11:36:23 bumblebee.localdomain chronyd[1087]: Selected source 50.205.244.27
[root@bumblebee ~]# pgrep -laf sss
1148 /usr/sbin/sssd -D -f
1149 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 --debug-to-files
1150 /usr/libexec/sssd/sssd_be --domain FOO --uid 0 --gid 0 --debug-to-files
1152 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
1153 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files
1154 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --debug-to-files

Comment 33 Jeff Bastian 2016-08-18 17:31:04 UTC

Since the factor seems to be an RTC that's out of sync, and not the aarch64 architecture, I tested on an x86_64 system with RHEL 7.3 Beta by changing the clock and rebooting and I was able to reproduce the problem:

[root@localhost ~]# uname -r
3.10.0-493.el7.x86_64
[root@localhost ~]# systemctl enable sssd.service
[root@localhost ~]# systemctl stop chronyd.service
[root@localhost ~]# date
Thu Aug 18 13:22:46 EDT 2016
[root@localhost ~]# date 08181400
Thu Aug 18 14:00:00 EDT 2016
[root@localhost ~]# hwclock --systohc
[root@localhost ~]# hwclock
Thu 18 Aug 2016 02:00:10 PM EDT  -0.141171 seconds
[root@localhost ~]# reboot
...
...
[root@localhost ~]# journalctl -b0 | egrep -i 'rtc|chrony'
...
Aug 18 10:01:51 localhost.localdomain kernel: rtc_cmos 00:02: setting system clock to 2016-08-18 14:01:51 UTC (1471528911)
...
Aug 18 14:02:04 localhost.example.com chronyd[670]: System clock wrong by -2228.878225 seconds, adjustment started
Aug 18 13:24:56 localhost.example.com chronyd[670]: System clock was stepped by -2228.878225 seconds
...
[root@localhost ~]# systemctl status sssd.service
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sssd.service.d
           └─journal.conf
   Active: failed (Result: exit-code) since Thu 2016-08-18 13:25:28 EDT; 3min 34s ago
  Process: 666 ExecStart=/usr/sbin/sssd -D -f (code=exited, status=0/SUCCESS)
 Main PID: 693 (code=exited, status=1/FAILURE)

Comment 34 Jeff Bastian 2016-08-18 17:51:50 UTC

It seems sssd only dies if the clock is adjusted backwards.

In the comment above, I set the clock in the future so chronyd had to adjust the clock in a negative direction.  I tested again, but this time set the clock in the past, and sssd stayed running.

[root@localhost ~]# systemctl stop chronyd
[root@localhost ~]# date
Thu Aug 18 13:33:20 EDT 2016
[root@localhost ~]# date 08181200
Thu Aug 18 12:00:00 EDT 2016
[root@localhost ~]# hwclock --systohc
[root@localhost ~]# hwclock
Thu 18 Aug 2016 12:00:08 PM EDT  -0.844269 seconds
[root@localhost ~]# reboot
...
...
[root@localhost ~]# journalctl -b0 | egrep -i 'rtc|chrony'
...
Aug 18 08:03:20 localhost.localdomain kernel: rtc_cmos 00:02: setting system clock to 2016-08-18 12:03:20 UTC (1471521800)
...
Aug 18 12:03:34 localhost.example.com chronyd[666]: System clock wrong by 5603.860154 seconds, adjustment started
Aug 18 13:36:58 localhost.example.com chronyd[666]: System clock was stepped by 5603.860154 seconds
...
[root@localhost ~]# uptime
 13:49:37 up 12 min,  1 user,  load average: 0.00, 0.01, 0.03
[root@localhost ~]# systemctl status sssd.service
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sssd.service.d
           └─journal.conf
   Active: active (running) since Thu 2016-08-18 12:03:25 EDT; 1h 46min ago
  Process: 659 ExecStart=/usr/sbin/sssd -D -f (code=exited, status=0/SUCCESS)
 Main PID: 684 (sssd)
   CGroup: /system.slice/sssd.service
           ├─684 /usr/sbin/sssd -D -f
           ├─686 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 -...
           ├─715 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
           ├─716 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files
           └─717 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --debug-to-fil...

Comment 35 Lukas Slebodnik 2016-08-18 20:13:19 UTC

Jeff,
thank you very much for testing and confirmation of our suspicion.

Comment 36 Jakub Hrozek 2016-08-19 08:10:42 UTC

Yes, thanks for testing.

Now I wonder how to fix this bug. Would it make sense to start sssd after chrony?  I would say it might be a good idea in general to make sure that the time is in sync for providers that rely on Kerberos..

Comment 37 Lukas Slebodnik 2016-08-19 08:30:44 UTC

It's not just about chronyd. There are also ntpd.service systemd-timesyncd.service. And moreover it will cause issues for local users (but that's slightly unrelated to this ticket)

Comment 38 Jakub Hrozek 2016-08-19 08:34:56 UTC

What issues for local users?

About chrony vs ntpd, could we just use both? If one of them is not installed, I think the After would be ignored..

I'm not sure how else to fix this, the issue seems to be that tevent timers use struct timeval with absolute time..

Comment 39 Lukas Slebodnik 2016-08-19 09:36:19 UTC

(In reply to Jakub Hrozek from comment #38)
> What issues for local users?
> 
That's pure upstream discussion so lets' avoid it here.

> About chrony vs ntpd, could we just use both? If one of them is not
> installed, I think the After would be ignored..
> 
> I'm not sure how else to fix this, the issue seems to be that tevent timers
> use struct timeval with absolute time..
Changing the order of starting services might help at boot time.
However time can be changed also later. Manually changed date, time synchronisation daemon failed to synchronize date due to various reasons (was disabled at boot time, bugs ...) How do we want to handle such use-cases?

Comment 40 Jakub Hrozek 2016-08-19 09:40:37 UTC

OK, do we already know why the time jumps backwards on the test systems? Is it just a property of the test systems?

Unless sssd can get a notification about the system time changing, then I don't think we can do anything for the general case..

Comment 41 Jeff Bastian 2016-08-19 12:57:37 UTC

This happens on all systems.  It's very common for the battery-backed RTC to drift, it may run slightly too fast or too slow.  If it runs fast, the RTC will drift into the future.  When the system boots, the kernel reads the RTC to initialize the system time, then chrony or ntp starts and makes any necessary adjustments.

Sometimes the RTC is just wildly wrong -- off by years or even decades -- because it was never set correctly in the first place.  And sometimes the battery dies, so when the systems is powered off, the RTC resets to its internal start date (often some day in 1970, 1980 or 2000, I don't know why).

There was a proposal to add a notification mechanism to the kernel for system time changes, but it does not appear to have been accepted.
  https://lkml.org/lkml/2010/11/11/353

The normal method is to call gettimeofday() and make a note, then set an alarm or sleep for a set amount of time, and call gettimeofday() again, and if it changed by more than the expected sleep time, then the clock must have been adjusted.

This wasn't a problem for sssd-1.13.0.  What changed in 1.14.0 to make it sensitive to time changes?

Comment 42 Jakub Hrozek 2016-08-19 13:28:51 UTC

(In reply to Jeff Bastian from comment #41)
> This happens on all systems.  It's very common for the battery-backed RTC to
> drift, it may run slightly too fast or too slow.  If it runs fast, the RTC
> will drift into the future.  When the system boots, the kernel reads the RTC
> to initialize the system time, then chrony or ntp starts and makes any
> necessary adjustments.
> 
> Sometimes the RTC is just wildly wrong -- off by years or even decades --
> because it was never set correctly in the first place.  And sometimes the
> battery dies, so when the systems is powered off, the RTC resets to its
> internal start date (often some day in 1970, 1980 or 2000, I don't know why).
> 
> There was a proposal to add a notification mechanism to the kernel for
> system time changes, but it does not appear to have been accepted.
>   https://lkml.org/lkml/2010/11/11/353
> 
> The normal method is to call gettimeofday() and make a note, then set an
> alarm or sleep for a set amount of time, and call gettimeofday() again, and
> if it changed by more than the expected sleep time, then the clock must have
> been adjusted.
> 

Thank you for the analysis. We might want to do something like this for our watchdog.

> This wasn't a problem for sssd-1.13.0.  What changed in 1.14.0 to make it
> sensitive to time changes?

We changed the way the "worker" processes like sssd_nss or sssd_be are kept alive and autorestarted. Previously, the main 'sssd' process was sending 'pings' to the worker processes and they would reply with a 'pong'. When the worker failed to answer three times to a ping, the sssd process killed it and spawned a new one. But that was really complex and brittle, so in 1.14 we changed it so that the worker processes themselves set up a timer using timer_create() that send SIGRT to 'self'. When SIGRT is received, a counter is increased and when the counter is increased over a threshold (3 by default), the process quits. There is also a timer event using sssd's event loop that periodically clears the counter -- this is done using sssd's event loop to make sure events from the loop are processed actually.

Comment 43 Pavel Březina 2016-08-22 10:04:52 UTC

We can detect time shift in watchdog_handler(). We can store current time in setup_watchdog(), if it is bigger 

setup_watchdog()
watchdogctx.current_time <- time

watchdog_handler()
watchdogctx.current_time > time => timeshift occured, restart watchdog (and other tevent timers if possible)
watchdogctx.current_time <- time

Comment 44 Jakub Hrozek 2016-08-22 10:24:31 UTC

OK, can you prepare a patch?

Comment 47 Lukas Slebodnik 2016-08-24 08:52:51 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/3154

Comment 49 Pavel Březina 2016-08-24 12:01:21 UTC

Thank you. I submitted patch for review.

Comment 51 Patrik Kis 2016-08-24 13:00:36 UTC

The bug was found and verified by test: /CoreOS/realmd/Sanity/AD-join-and-reboot
It can be found here:
* git: http://pkgs.devel.redhat.com/cgit/tests/realmd/tree/Sanity/AD-join-and-reboot
* TCMS: https://tcms.engineering.redhat.com/case/269538/
* beaker: https://beaker.engineering.redhat.com/tasks/12975

Feel free to execute it. It basically joins the machine to our test AD via realmd and then perform a reboot to see if the changes made by realmd survives the reboot.
I have no problem to verify the bug as well, if you want.

But I think maybe a new test case should be created, because the test above found the issue only accidentally because the time change in aarch64 machines. This can change in the future and this test case will not found if there is a regression. On the other hand create an automated test case where the time changes during boot is not trivial.

Comment 52 Jeff Bastian 2016-08-24 14:37:06 UTC

It shouldn't be too difficult to automate this, you just need to set the RTC clock to a bit in the future and reboot, e.g.,

if [ "$REBOOTCOUNT" -eq 0 ] ; then
  systemctl stop chronyd.service
  date --set="now + 1 hour"
  hwclock --systohc
  rhts-reboot
else
  if systemctl --quiet is-active sssd.service ; then
    # report success
  else
    # report failure
  fi
fi

Comment 53 Jeff Bastian 2016-08-24 14:39:17 UTC

Oh, and maybe sleep a bit before checking if sssd.service is running:

  ...
else
  sleep 300
  if systemctl --quiet is-active sssd.service ; then
    ...

Comment 54 Lukas Slebodnik 2016-08-25 14:04:17 UTC

master:
* b8ceaeb80cffb00c26390913ea959b77f7e848b9

Comment 56 shridhar 2016-09-21 14:31:02 UTC

Verified with 

~]# rpm -qa|grep sssd
sssd-client-1.14.0-43.el7.aarch64
sssd-krb5-common-1.14.0-43.el7.aarch64
sssd-ad-1.14.0-43.el7.aarch64
sssd-ldap-1.14.0-43.el7.aarch64
sssd-proxy-1.14.0-43.el7.aarch64
python-sssdconfig-1.14.0-43.el7.noarch
sssd-common-1.14.0-43.el7.aarch64
sssd-common-pac-1.14.0-43.el7.aarch64
sssd-ipa-1.14.0-43.el7.aarch64
sssd-krb5-1.14.0-43.el7.aarch64
sssd-1.14.0-43.el7.aarch64




~]# uname -a
Linux apm-mustang-ev3-01.lab.eng.brq.redhat.com 4.5.0-11.el7.aarch64 #1 SMP Thu Sep 15 08:51:15 EDT 2016 aarch64 aarch64 aarch64 GNU/Linux

[root@apm-mustang-ev3-01 ~]# systemctl stop chronyd.service
[root@apm-mustang-ev3-01 ~]# date
Wed Sep 21 16:25:04 CEST 2016
[root@apm-mustang-ev3-01 ~]#  hwclock
Thu 18 Aug 2016 01:14:43 PM CEST  -0.441143 seconds
[root@apm-mustang-ev3-01 ~]# reboot 

[......]

[root@apm-mustang-ev3-01 ~]# journalctl -b0 | egrep -i 'rtc|chrony'
Aug 18 15:16:15 localhost.localdomain kernel: ACPI: SSDT 0x00000043FA818000 000032 (v02 REDHAT UARTCLKS 00000001      01000013)
Aug 18 15:16:16 localhost.localdomain kernel: rtc-efi rtc-efi: rtc core: registered rtc-efi as rtc0
Aug 18 15:16:16 localhost.localdomain kernel: rtc-efi rtc-efi: setting system clock to 2016-08-18 13:16:15 UTC (1471526175)
Aug 18 13:16:22 localhost.localdomain systemd[1]: RTC configured in localtime, applying delta of 120 minutes to system time.
Aug 18 13:16:29 localhost.localdomain chronyd[733]: chronyd version 2.1.1 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +DEBUG +ASYNCDNS +IPV6 +SECHASH)
Aug 18 13:16:29 localhost.localdomain chronyd[733]: Frequency 80.091 +/- 0.075 ppm read from /var/lib/chrony/drift
Aug 18 13:16:30 localhost.localdomain systemd[1]: Starting Wait for chrony to synchronize system clock...
Aug 18 13:16:49 apm-mustang-ev3-01.lab.eng.brq.redhat.com chronyd[733]: Selected source 10.34.255.7
Aug 18 13:16:49 apm-mustang-ev3-01.lab.eng.brq.redhat.com chronyd[733]: System clock wrong by 2949029.077709 seconds, adjustment started
Sep 21 16:27:18 apm-mustang-ev3-01.lab.eng.brq.redhat.com chronyd[733]: System clock was stepped by 2949029.077709 seconds
Sep 21 16:27:19 apm-mustang-ev3-01.lab.eng.brq.redhat.com systemd[1]: Started Wait for chrony to synchronize system clock.
[root@apm-mustang-ev3-01 ~]# uptime
 16:28:27 up 1 min,  1 user,  load average: 0.41, 0.23, 0.09
[root@apm-mustang-ev3-01 ~]#  systemctl status sssd
● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sssd.service.d
           └─journal.conf
   Active: active (running) since Thu 2016-08-18 13:16:32 CEST; 1 months 3 days ago
  Process: 729 ExecStart=/usr/sbin/sssd -D -f (code=exited, status=0/SUCCESS)
 Main PID: 776 (sssd)
   CGroup: /system.slice/sssd.service
           ├─776 /usr/sbin/sssd -D -f
           ├─778 /usr/libexec/sssd/sssd_be --domain testad.qe --uid 0 --gid 0 --debug-to-files
           ├─808 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
           └─809 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files

Aug 18 13:16:29 localhost.localdomain systemd[1]: Starting System Security Services Daemon...
Aug 18 13:16:30 localhost.localdomain sssd[776]: Starting up
Aug 18 13:16:30 localhost.localdomain sssd[be[testad.qe]][778]: Starting up
Aug 18 13:16:32 localhost.localdomain sssd[nss][808]: Starting up
Aug 18 13:16:32 localhost.localdomain sssd[pam][809]: Starting up
Aug 18 13:16:32 localhost.localdomain systemd[1]: Started System Security Services Daemon.
Warning: sssd.service changed on disk. Run 'systemctl daemon-reload' to reload units.

Comment 58 errata-xmlrpc 2016-11-04 07:20:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2476.html