This is a follow-up from https://bugzilla.redhat.com/show_bug.cgi?id=1366403 . It seems the crash reported there didn't actually cause the login fail I thought it did; sssd-1.14.0-5.fc25 fixes the crash, but not the login failure. So. Even with sssd-1.14.0-5.fc25 : in our openQA test which enrols a system as a FreeIPA domain client via a kickstart, the enrolment works, boot of the installed system works and several tests of the environment work, but login as a domain user account does *not* work. The test sets up a FreeIPA server on one VM - also using Fedora 25, I have the test modified at present to use updates-testing so we don't hit https://bugzilla.redhat.com/show_bug.cgi?id=1353054 - and sets up a one-time password for a client enrolment. It also creates a couple of user accounts, test1 and test2, and sets their passwords to 'batterystaple'. It then has another VM perform a kickstart install using this kickstart: install cdrom bootloader --location=mbr network --device=link --activate --bootproto=static --ip=10.0.2.101 --netmask=255.255.255.0 --gateway=10.0.2.2 --hostname=client001.domain.local --nameserver=10.0.2.100 lang en_US.UTF-8 keyboard us timezone --utc America/New_York clearpart --all autopart %packages @^server-product-environment %end rootpw anaconda reboot realm join --one-time-password=monkeys ipa001.domain.local it then boots the client VM, logs in as root, updates sssd to 1.14.0-5.fc25 and reboots. Then it does the following checks, which all pass: 1. Run 'realm list' and check the output is correct 2. Run 'getent passwd admin' and check it works 3. Run 'getent passwd test1' and check it works 4. Run 'klist -k' and check the output is correct 5. Run 'kinit -k host/(hostname)@DOMAIN.LOCAL' and check it works It then attempts to log in to a tty as test1 with password 'batterystaple'. This fails. There is a similar test which, instead of enrolling during install using a one-time enrolment password, enrols post-install by calling 'realm join --user=admin ipa001.domain.local' and giving the admin password. It then goes through all the same post-enrolment tests. When *this* test tries to log in as test1, it works. I enabled sssd debugging in the [pam] and [domain/domain.local] sections of sssd.conf, and will attach all the non-zero sized log files that produced. I am transferring the blocker status from 1366403 here, since we were really voting on the believed *consequence* of that bug - which is the login problem. Since the login problem doesn't actually seem to be caused by the crash, it seems reasonable to consider this a blocker without further review.
Created attachment 1191421 [details] krb5_child.log
Created attachment 1191422 [details] ldap_child.log
Created attachment 1191423 [details] sssd_domain.local.log
Created attachment 1191424 [details] sssd_pam.log
Created attachment 1191425 [details] journal (view with 'journalctl --file')
The most obvious bit of log that stands out is this, in krb5_child.log : (Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [main] (0x0400): Will perform online auth (Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [get_and_save_tgt] (0x0400): Attempting kinit for realm [DOMAIN.LOCAL] (Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [get_and_save_tgt] (0x0020): 1296: [-1765328351][Error constructing AP-REQ armor: Ticket not yet valid] (Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [map_krb5_error] (0x0020): 1365: [-1765328351][Error constructing AP-REQ armor: Ticket not yet valid] (Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [k5c_send_data] (0x0200): Received error code 1432158209 (Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [main] (0x0400): krb5_child completed successfully that's at exactly the time the login attempt is made (you can see it failing in the journal). You can download the entire contents of /var/log from an affected test run at https://openqa.stg.fedoraproject.org/tests/31702/file/freeipa_client_postinstall-var_log.tar.gz .
For comparison, /var/log from a similar run of the post-install enrolment test (where the login succeeds) can be found at https://openqa.stg.fedoraproject.org/tests/31710/file/freeipa_client_postinstall-var_log.tar.gz . I had to adjust the test to just restart sssd after updating it, rather than rebooting the system, for boring reasons related to openQA networking. But otherwise the test is very similar. It starts from a stock install of the Server DVD, then enrols into the domain using realm, then updates sssd, turns on debugging and restarts it before running through the same set of checks. I deliberately sabotaged the test to fail near the end so the logs would get uploaded, but the login as test1 succeeded.
So, it is a timing issue. SSSD gets the TGT of the host principal before the time is adjusted (by chrony?), and thus, when time rewinds back, this ticket becomes not valid yet, denying any use of it.
hmm? I can't see that in the logs...in fact, it seems ntpd decides the clock is too massively wrong and refuses to adjust it: Aug 16 11:32:50 client001.domain.local ntpd_intres[919]: DNS ipa001.domain.local -> 10.0.2.100 Aug 16 11:32:51 client001.domain.local ntpd[917]: Listen normally on 4 ens4 10.0.2.101 UDP 123 Aug 16 11:32:51 client001.domain.local ntpd[917]: new interface(s) found: waking up resolver Aug 16 11:32:52 client001.domain.local ntpd[917]: 0.0.0.0 0617 07 panic_stop +17999 s; set clock manually within 1000 s. Aug 16 11:32:52 client001.domain.local systemd[1]: ntpd.service: Main process exited, code=exited, status=255/n/a of course, that could be the problem, I guess - the guest clock being badly out of sync with the host? Comparing to the logs from the post-install enrolment, I see that in that case, ntpd succeeds in adjusting the system clock at the time of the enrolment... the two tests were run at nearly the same time, so it seems significant that the system time on the 'bad' case is shown as ~11:30 while the system time on the 'good' case is shown as ~18:30. The actual time the tests was run was around 23:30 UTC. I guess we can assume the server believed the time to be ~18:30, which would line up. I suppose now I get to investigate what's going on with these different clock settings...but this may not be a blocker...
Adam, this is visible in the logs: [-1765328351][Error constructing AP-REQ armor: Ticket not yet valid] SSSD tries to use host/<hostname> TGT for armoring the acquisition of the user's TGT. KDC rejects the armor ticket because it is not valid yet, thus the error messages in the comment 6.
alex: yes, but there aren't any actual time *jumps* - it's not a race between sssd and something that should correct the time, exactly, it's just that the time is completely incorrect all along and nothing corrects it. I can see a significant difference between 'bad' and 'good' cases involving chrony: [adamw@adam tmp]$ journalctl --file bad/var/log/journal/*/system.journal | grep chron Aug 16 11:29:24 client001.domain.local kernel: TSC synchronization [CPU#0 -> CPU#1]: [adamw@adam tmp]$ journalctl --file good/var/log/journal/*/system.journal | grep chron Aug 16 07:24:28 localhost.localdomain chronyd[956]: chronyd version 2.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH +ASYNCDNS +IPV6 +DEBUG) Aug 16 07:24:28 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Aug 16 07:24:34 localhost.localdomain chronyd[956]: Selected source 108.59.2.24 Aug 16 07:24:34 localhost.localdomain chronyd[956]: System clock wrong by 17998.956725 seconds, adjustment started Aug 16 12:24:33 localhost.localdomain chronyd[956]: System clock was stepped by 17998.956725 seconds Aug 16 12:24:43 localhost.localdomain chronyd[956]: Selected source 131.107.13.100 Aug 16 12:25:01 localhost.localdomain chronyd[956]: chronyd exiting Aug 16 12:25:01 localhost.localdomain audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Aug 16 18:09:48 localhost.localdomain kernel: TSC synchronization [CPU#0 -> CPU#1]: Aug 16 18:09:50 localhost.localdomain chronyd[916]: chronyd version 2.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH +ASYNCDNS +IPV6 +DEBUG) Aug 16 18:09:50 localhost.localdomain chronyd[916]: Frequency 7.941 +/- 476.655 ppm read from /var/lib/chrony/drift Aug 16 18:09:50 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Aug 16 18:23:01 client003.domain.local realmd[1972]: Synchronizing time with KDC... Aug 16 18:23:01 client003.domain.local realmd[1972]: Synchronizing time with KDC... Aug 16 18:23:10 client003.domain.local chronyd[916]: chronyd exiting Aug 16 18:23:10 client003.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Aug 16 18:23:10 client003.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' that is, chrony runs and corrects the clock (by a substantial margin) on first boot in the 'good' case. It never runs in the 'bad' case...even though the anaconda logs seem to log it being installed and enabled: [adamw@adam tmp]$ grep -i chrony bad/var/log/anaconda/*.log bad/var/log/anaconda/journal.log:Aug 16 23:24:31 localhost NetworkManager[1699]: <debug> [1471389871.3004] dispatcher: (1) 20-chrony succeeded bad/var/log/anaconda/journal.log:Aug 16 23:24:36 localhost NetworkManager[1699]: <debug> [1471389876.8315] dispatcher: (3) 20-chrony succeeded bad/var/log/anaconda/journal.log:Aug 16 23:24:36 localhost NetworkManager[1699]: <debug> [1471389876.8914] dispatcher: (5) 20-chrony succeeded bad/var/log/anaconda/journal.log:Aug 16 23:24:40 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' bad/var/log/anaconda/journal.log:Aug 16 23:28:14 client001.domain.local NetworkManager[1699]: <debug> [1471390094.2656] dispatcher: (6) 20-chrony succeeded bad/var/log/anaconda/packaging.log:23:24:45,854 DEBUG dnf: Adding packages from group 'server-product': {'polkit', 'timedatex', 'fedora-release-server', 'PackageKit', 'chrony', 'realmd'} bad/var/log/anaconda/packaging.log:18:25:08,305 DEBUG dnf: Adding packages from group 'server-product': {'polkit', 'timedatex', 'fedora-release-server', 'PackageKit', 'chrony', 'realmd'} bad/var/log/anaconda/packaging.log:18:25:08,312 INFO packaging: selected package: 'chrony' bad/var/log/anaconda/packaging.log:18:26:14,434 INFO packaging: Installed: chrony-2.4-2.fc25.x86_64 1467116162 4a373e86ba61bd4075b5124a25ffa3ba95f9f658bc829f3f70eee748520a761e bad/var/log/anaconda/packaging.log:18:26:14,486 INFO dnf.rpm: Installed: chrony-2.4-2.fc25.x86_64 bad/var/log/anaconda/packaging.log:18:26:14,486 INFO dnf.rpm: Installed: chrony-2.4-2.fc25.x86_64 bad/var/log/anaconda/program.log:23:24:40,384 INFO program: Running... systemctl start chronyd bad/var/log/anaconda/program.log:18:25:08,105 INFO program: Running... systemctl status chronyd bad/var/log/anaconda/program.log:18:25:08,124 INFO program: * chronyd.service - NTP client/server bad/var/log/anaconda/program.log:18:25:08,124 INFO program: Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled) bad/var/log/anaconda/program.log:18:25:08,124 INFO program: Process: 1967 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS) bad/var/log/anaconda/program.log:18:25:08,125 INFO program: Process: 1954 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS) bad/var/log/anaconda/program.log:18:25:08,125 INFO program: Main PID: 1965 (chronyd) bad/var/log/anaconda/program.log:18:25:08,125 INFO program: CGroup: /system.slice/chronyd.service bad/var/log/anaconda/program.log:18:25:08,125 INFO program: `-1965 /usr/sbin/chronyd bad/var/log/anaconda/program.log:23:28:13,278 INFO program: Running... systemctl enable chronyd --root /mnt/sysimage I wonder if this is because the domain enrolment that happens during install disables chrony in favour of ntpd - which refuses to make a large clock adjustment on initial boot? Of course I don't know why the clocks in the VMs are off by such wild amounts in the first place (I'd have to look into the details of how os-autoinst sets up the VM clocks, I guess...), nor why this didn't happen in F24 tests...
The theory that enrollment disables chronyd is sound. We don't have chronyd support in FreeIPA yet, so we always disable it in favor of ntpd if that is present and ipa-client-install was not forced to ignore NTP at all. You may want to add '-g' option to ntpd to be able to correct time difference beyond 1000s. FreeIPA uses different set of options for IPA client and IPA server: servers are always forced with '-g' option, while clients don't have it.
OK, so I think I spot the underlying bug here: after a completely vanilla install of Fedora Server 24 the system clock is set pretty much correctly. After a completely vanilla install of Fedora Server 25 it is not, it's set - in my timezone at least - eight hours slow. We'll have to figure out why that is, I guess. Whether it's a blocker bug is an interesting question...but I think the result will be reproducible unless you happen to be in a timezone where the anaconda bug doesn't happen. My first theory is that something in anaconda is attempting to compensate for timezones and screwing it up, since the amount the clock is wrong by in my test (8 hours) is PST's offset from UTC (-800)...though currently I'm in PDT. So if I'm right on that, you wouldn't notice the bug if your timezone happened to match UTC, and it *may* not affect you (I dunno how Kerberos behaves in that case) if your timezone is ahead of UTC rather than behind (and thus anaconda sets your clock in the future rather than the past). But this is all just conjecture so far, I'll have to do some more tests and investigation.
So I've started investigating the underlying issue in anaconda, and reported it as https://bugzilla.redhat.com/show_bug.cgi?id=1367647 , and marked that as blocking this bug.
We have a fix for the underlying anaconda bug now: https://bodhi.fedoraproject.org/updates/FEDORA-2016-8988869509 . I confirmed it makes anaconda behave itself with regard to time setting, and the installed system has correct time from first boot. I couldn't directly confirm that it fixes this bug for...tedious reasons to do with openQA config, but I'm pretty sure it should, we'll find out when it gets pushed stable if I don't bother fiddling about with openQA to test it sooner.
I am glad that anaconda bug is fixed. However, I cannot see any bug in sssd. So I'm not sure why the state was changed to ON_QA. Feel free to reopen if you find a bug in sssd (or file a new one)
Please don't just unilaterally close release blocker bugs. They're important. I was keeping the bugs separate because we know this bug exists and we know there's an issue with anaconda's timekeeping, but we only *think* that bug causes this bug - we don't know for sure. This is the actual issue that is accepted as a blocker. I was just trying to avoid the same thing happening as with the other bug, where we assumed we knew the cause of the login failure but actually were wrong.
Then please change a component. You did not prove it is a bug in sssd. If you do not prove it's a bug in sssd then I will keep closing this ticket.
...that's pretty rude.
(In reply to Adam Williamson from comment #19) > ...that's pretty rude. I'm so sorry but I do not want to have assigned blocker bug on sssd if I cannnot do anything with a bug. Log files does not lie (comment 6, comment 10) If it wasn't a blocker bug then I would not mind it.
Woohoo, latest openQA tests confirm this is working. Update has gone stable.