1367604 – Log in as (FreeIPA) domain user fails after domain join via kickstart

Bug 1367604 - Log in as (FreeIPA) domain user fails after domain join via kickstart

Summary: Log in as (FreeIPA) domain user fails after domain join via kickstart

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	anaconda
Sub Component:
Version:	25
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Anaconda Maintenance Team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedBlocker
Depends On:	1367647
Blocks:	F25AlphaBlocker
TreeView+	depends on / blocked

Reported:	2016-08-16 23:54 UTC by Adam Williamson
Modified:	2016-08-22 15:33 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-22 15:33:31 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
krb5_child.log (4.20 KB, text/plain) 2016-08-16 23:55 UTC, Adam Williamson	no flags	Details
ldap_child.log (1.98 KB, text/plain) 2016-08-16 23:55 UTC, Adam Williamson	no flags	Details
sssd_domain.local.log (186.96 KB, text/plain) 2016-08-16 23:57 UTC, Adam Williamson	no flags	Details
sssd_pam.log (12.87 KB, text/plain) 2016-08-16 23:58 UTC, Adam Williamson	no flags	Details
journal (view with 'journalctl --file') (8.00 MB, application/octet-stream) 2016-08-17 00:00 UTC, Adam Williamson	no flags	Details
View All

Description Adam Williamson 2016-08-16 23:54:32 UTC

This is a follow-up from https://bugzilla.redhat.com/show_bug.cgi?id=1366403 . It seems the crash reported there didn't actually cause the login fail I thought it did; sssd-1.14.0-5.fc25 fixes the crash, but not the login failure.

So. Even with sssd-1.14.0-5.fc25 : in our openQA test which enrols a system as a FreeIPA domain client via a kickstart, the enrolment works, boot of the installed system works and several tests of the environment work, but login as a domain user account does *not* work.

The test sets up a FreeIPA server on one VM - also using Fedora 25, I have the test modified at present to use updates-testing so we don't hit https://bugzilla.redhat.com/show_bug.cgi?id=1353054 - and sets up a one-time password for a client enrolment. It also creates a couple of user accounts, test1 and test2, and sets their passwords to 'batterystaple'. It then has another VM perform a kickstart install using this kickstart:

install
cdrom
bootloader --location=mbr
network --device=link --activate --bootproto=static --ip=10.0.2.101 --netmask=255.255.255.0 --gateway=10.0.2.2 --hostname=client001.domain.local --nameserver=10.0.2.100
lang en_US.UTF-8
keyboard us
timezone --utc America/New_York
clearpart --all
autopart
%packages
@^server-product-environment
%end
rootpw anaconda
reboot
realm join --one-time-password=monkeys ipa001.domain.local

it then boots the client VM, logs in as root, updates sssd to 1.14.0-5.fc25 and reboots. Then it does the following checks, which all pass:

1. Run 'realm list' and check the output is correct
2. Run 'getent passwd admin' and check it works
3. Run 'getent passwd test1' and check it works
4. Run 'klist -k' and check the output is correct
5. Run 'kinit -k host/(hostname)@DOMAIN.LOCAL' and check it works

It then attempts to log in to a tty as test1 with password 'batterystaple'. This fails.

There is a similar test which, instead of enrolling during install using a one-time enrolment password, enrols post-install by calling 'realm join --user=admin ipa001.domain.local' and giving the admin password. It then goes through all the same post-enrolment tests. When *this* test tries to log in as test1, it works.

I enabled sssd debugging in the [pam] and [domain/domain.local] sections of sssd.conf, and will attach all the non-zero sized log files that produced.

I am transferring the blocker status from 1366403 here, since we were really voting on the believed *consequence* of that bug - which is the login problem. Since the login problem doesn't actually seem to be caused by the crash, it seems reasonable to consider this a blocker without further review.

Comment 1 Adam Williamson 2016-08-16 23:55:17 UTC

Created attachment 1191421 [details]
krb5_child.log

Comment 2 Adam Williamson 2016-08-16 23:55:49 UTC

Created attachment 1191422 [details]
ldap_child.log

Comment 3 Adam Williamson 2016-08-16 23:57:08 UTC

Created attachment 1191423 [details]
sssd_domain.local.log

Comment 4 Adam Williamson 2016-08-16 23:58:10 UTC

Created attachment 1191424 [details]
sssd_pam.log

Comment 5 Adam Williamson 2016-08-17 00:00:04 UTC

Created attachment 1191425 [details]
journal (view with 'journalctl --file')

Comment 6 Adam Williamson 2016-08-17 00:03:48 UTC

The most obvious bit of log that stands out is this, in krb5_child.log :

(Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [main] (0x0400): Will perform online auth
(Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [get_and_save_tgt] (0x0400): Attempting kinit for realm [DOMAIN.LOCAL]
(Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [get_and_save_tgt] (0x0020): 1296: [-1765328351][Error constructing AP-REQ armor: Ticket not yet valid]
(Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [map_krb5_error] (0x0020): 1365: [-1765328351][Error constructing AP-REQ armor: Ticket not yet valid]
(Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [k5c_send_data] (0x0200): Received error code 1432158209
(Tue Aug 16 14:34:35 2016) [[sssd[krb5_child[1243]]]] [main] (0x0400): krb5_child completed successfully

that's at exactly the time the login attempt is made (you can see it failing in the journal).

You can download the entire contents of /var/log from an affected test run at https://openqa.stg.fedoraproject.org/tests/31702/file/freeipa_client_postinstall-var_log.tar.gz .

Comment 7 Adam Williamson 2016-08-17 01:35:37 UTC

For comparison, /var/log from a similar run of the post-install enrolment test (where the login succeeds) can be found at https://openqa.stg.fedoraproject.org/tests/31710/file/freeipa_client_postinstall-var_log.tar.gz . I had to adjust the test to just restart sssd after updating it, rather than rebooting the system, for boring reasons related to openQA networking. But otherwise the test is very similar. It starts from a stock install of the Server DVD, then enrols into the domain using realm, then updates sssd, turns on debugging and restarts it before running through the same set of checks. I deliberately sabotaged the test to fail near the end so the logs would get uploaded, but the login as test1 succeeded.

Comment 8 Alexander Bokovoy 2016-08-17 03:55:58 UTC

So, it is a timing issue. SSSD gets the TGT of the host principal before the time is adjusted (by chrony?), and thus, when time rewinds back, this ticket becomes not valid yet, denying any use of it.

Comment 9 Adam Williamson 2016-08-17 04:24:07 UTC

hmm? I can't see that in the logs...in fact, it seems ntpd decides the clock is too massively wrong and refuses to adjust it:

Aug 16 11:32:50 client001.domain.local ntpd_intres[919]: DNS ipa001.domain.local -> 10.0.2.100
Aug 16 11:32:51 client001.domain.local ntpd[917]: Listen normally on 4 ens4 10.0.2.101 UDP 123
Aug 16 11:32:51 client001.domain.local ntpd[917]: new interface(s) found: waking up resolver
Aug 16 11:32:52 client001.domain.local ntpd[917]: 0.0.0.0 0617 07 panic_stop +17999 s; set clock manually within 1000 s.
Aug 16 11:32:52 client001.domain.local systemd[1]: ntpd.service: Main process exited, code=exited, status=255/n/a

of course, that could be the problem, I guess - the guest clock being badly out of sync with the host?

Comparing to the logs from the post-install enrolment, I see that in that case, ntpd succeeds in adjusting the system clock at the time of the enrolment...

the two tests were run at nearly the same time, so it seems significant that the system time on the 'bad' case is shown as ~11:30 while the system time on the 'good' case is shown as ~18:30. The actual time the tests was run was around 23:30 UTC. I guess we can assume the server believed the time to be ~18:30, which would line up. I suppose now I get to investigate what's going on with these different clock settings...but this may not be a blocker...

Comment 10 Alexander Bokovoy 2016-08-17 04:51:38 UTC

Adam, this is visible in the logs:

[-1765328351][Error constructing AP-REQ armor: Ticket not yet valid]

SSSD tries to use host/<hostname> TGT for armoring the acquisition of the user's TGT. KDC rejects the armor ticket because it is not valid yet, thus the error messages in the comment 6.

Comment 11 Adam Williamson 2016-08-17 05:33:32 UTC

alex: yes, but there aren't any actual time *jumps* - it's not a race between sssd and something that should correct the time, exactly, it's just that the time is completely incorrect all along and nothing corrects it.

I can see a significant difference between 'bad' and 'good' cases involving chrony:

[adamw@adam tmp]$ journalctl --file bad/var/log/journal/*/system.journal | grep chron
Aug 16 11:29:24 client001.domain.local kernel: TSC synchronization [CPU#0 -> CPU#1]:
[adamw@adam tmp]$ journalctl --file good/var/log/journal/*/system.journal | grep chron
Aug 16 07:24:28 localhost.localdomain chronyd[956]: chronyd version 2.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH +ASYNCDNS +IPV6 +DEBUG)
Aug 16 07:24:28 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 16 07:24:34 localhost.localdomain chronyd[956]: Selected source 108.59.2.24
Aug 16 07:24:34 localhost.localdomain chronyd[956]: System clock wrong by 17998.956725 seconds, adjustment started
Aug 16 12:24:33 localhost.localdomain chronyd[956]: System clock was stepped by 17998.956725 seconds
Aug 16 12:24:43 localhost.localdomain chronyd[956]: Selected source 131.107.13.100
Aug 16 12:25:01 localhost.localdomain chronyd[956]: chronyd exiting
Aug 16 12:25:01 localhost.localdomain audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 16 18:09:48 localhost.localdomain kernel: TSC synchronization [CPU#0 -> CPU#1]:
Aug 16 18:09:50 localhost.localdomain chronyd[916]: chronyd version 2.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH +ASYNCDNS +IPV6 +DEBUG)
Aug 16 18:09:50 localhost.localdomain chronyd[916]: Frequency 7.941 +/- 476.655 ppm read from /var/lib/chrony/drift
Aug 16 18:09:50 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 16 18:23:01 client003.domain.local realmd[1972]: Synchronizing time with KDC...
Aug 16 18:23:01 client003.domain.local realmd[1972]: Synchronizing time with KDC...
Aug 16 18:23:10 client003.domain.local chronyd[916]: chronyd exiting
Aug 16 18:23:10 client003.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 16 18:23:10 client003.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

that is, chrony runs and corrects the clock (by a substantial margin) on first boot in the 'good' case. It never runs in the 'bad' case...even though the anaconda logs seem to log it being installed and enabled:

[adamw@adam tmp]$ grep -i chrony bad/var/log/anaconda/*.log
bad/var/log/anaconda/journal.log:Aug 16 23:24:31 localhost NetworkManager[1699]: <debug> [1471389871.3004] dispatcher: (1) 20-chrony succeeded
bad/var/log/anaconda/journal.log:Aug 16 23:24:36 localhost NetworkManager[1699]: <debug> [1471389876.8315] dispatcher: (3) 20-chrony succeeded
bad/var/log/anaconda/journal.log:Aug 16 23:24:36 localhost NetworkManager[1699]: <debug> [1471389876.8914] dispatcher: (5) 20-chrony succeeded
bad/var/log/anaconda/journal.log:Aug 16 23:24:40 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
bad/var/log/anaconda/journal.log:Aug 16 23:28:14 client001.domain.local NetworkManager[1699]: <debug> [1471390094.2656] dispatcher: (6) 20-chrony succeeded
bad/var/log/anaconda/packaging.log:23:24:45,854 DEBUG dnf: Adding packages from group 'server-product': {'polkit', 'timedatex', 'fedora-release-server', 'PackageKit', 'chrony', 'realmd'}
bad/var/log/anaconda/packaging.log:18:25:08,305 DEBUG dnf: Adding packages from group 'server-product': {'polkit', 'timedatex', 'fedora-release-server', 'PackageKit', 'chrony', 'realmd'}
bad/var/log/anaconda/packaging.log:18:25:08,312 INFO packaging: selected package: 'chrony'
bad/var/log/anaconda/packaging.log:18:26:14,434 INFO packaging: Installed: chrony-2.4-2.fc25.x86_64 1467116162 4a373e86ba61bd4075b5124a25ffa3ba95f9f658bc829f3f70eee748520a761e
bad/var/log/anaconda/packaging.log:18:26:14,486 INFO dnf.rpm: Installed: chrony-2.4-2.fc25.x86_64
bad/var/log/anaconda/packaging.log:18:26:14,486 INFO dnf.rpm: Installed: chrony-2.4-2.fc25.x86_64
bad/var/log/anaconda/program.log:23:24:40,384 INFO program: Running... systemctl start chronyd
bad/var/log/anaconda/program.log:18:25:08,105 INFO program: Running... systemctl status chronyd
bad/var/log/anaconda/program.log:18:25:08,124 INFO program: * chronyd.service - NTP client/server
bad/var/log/anaconda/program.log:18:25:08,124 INFO program: Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
bad/var/log/anaconda/program.log:18:25:08,124 INFO program: Process: 1967 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
bad/var/log/anaconda/program.log:18:25:08,125 INFO program: Process: 1954 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
bad/var/log/anaconda/program.log:18:25:08,125 INFO program: Main PID: 1965 (chronyd)
bad/var/log/anaconda/program.log:18:25:08,125 INFO program: CGroup: /system.slice/chronyd.service
bad/var/log/anaconda/program.log:18:25:08,125 INFO program: `-1965 /usr/sbin/chronyd
bad/var/log/anaconda/program.log:23:28:13,278 INFO program: Running... systemctl enable chronyd --root /mnt/sysimage

I wonder if this is because the domain enrolment that happens during install disables chrony in favour of ntpd - which refuses to make a large clock adjustment on initial boot? Of course I don't know why the clocks in the VMs are off by such wild amounts in the first place (I'd have to look into the details of how os-autoinst sets up the VM clocks, I guess...), nor why this didn't happen in F24 tests...

Comment 12 Alexander Bokovoy 2016-08-17 05:48:24 UTC

The theory that enrollment disables chronyd is sound. We don't have chronyd support in FreeIPA yet, so we always disable it in favor of ntpd if that is present and ipa-client-install was not forced to ignore NTP at all.

You may want to add '-g' option to ntpd to be able to correct time difference beyond 1000s. FreeIPA uses different set of options for IPA client and IPA server: servers are always forced with '-g' option, while clients don't have it.

Comment 13 Adam Williamson 2016-08-17 05:54:42 UTC

OK, so I think I spot the underlying bug here: after a completely vanilla install of Fedora Server 24 the system clock is set pretty much correctly. After a completely vanilla install of Fedora Server 25 it is not, it's set - in my timezone at least - eight hours slow. We'll have to figure out why that is, I guess.

Whether it's a blocker bug is an interesting question...but I think the result will be reproducible unless you happen to be in a timezone where the anaconda bug doesn't happen. My first theory is that something in anaconda is attempting to compensate for timezones and screwing it up, since the amount the clock is wrong by in my test (8 hours) is PST's offset from UTC (-800)...though currently I'm in PDT. So if I'm right on that, you wouldn't notice the bug if your timezone happened to match UTC, and it *may* not affect you (I dunno how Kerberos behaves in that case) if your timezone is ahead of UTC rather than behind (and thus anaconda sets your clock in the future rather than the past). But this is all just conjecture so far, I'll have to do some more tests and investigation.

Comment 14 Adam Williamson 2016-08-17 06:15:12 UTC

So I've started investigating the underlying issue in anaconda, and reported it as https://bugzilla.redhat.com/show_bug.cgi?id=1367647 , and marked that as blocking this bug.

Comment 15 Adam Williamson 2016-08-18 00:31:08 UTC

We have a fix for the underlying anaconda bug now: https://bodhi.fedoraproject.org/updates/FEDORA-2016-8988869509 . I confirmed it makes anaconda behave itself with regard to time setting, and the installed system has correct time from first boot. I couldn't directly confirm that it fixes this bug for...tedious reasons to do with openQA config, but I'm pretty sure it should, we'll find out when it gets pushed stable if I don't bother fiddling about with openQA to test it sooner.

Comment 16 Lukas Slebodnik 2016-08-18 07:14:44 UTC

I am glad that anaconda bug is fixed.
However, I cannot see any bug in sssd. So I'm not sure why the state was changed
to ON_QA. Feel free to reopen if you find a bug in sssd (or file a new one)

Comment 17 Adam Williamson 2016-08-18 14:40:47 UTC

Please don't just unilaterally close release blocker bugs. They're important. I was keeping the bugs separate because we know this bug exists and we know there's an issue with anaconda's timekeeping, but we only *think* that bug causes this bug - we don't know for sure. This is the actual issue that is accepted as a blocker. I was just trying to avoid the same thing happening as with the other bug, where we assumed we knew the cause of the login failure but actually were wrong.

Comment 18 Lukas Slebodnik 2016-08-18 14:46:40 UTC

Then please change a component.

You did not prove it is a bug in sssd.

If you do not prove it's a bug in sssd then I will keep closing this ticket.

Comment 19 Adam Williamson 2016-08-18 14:58:09 UTC

...that's pretty rude.

Comment 20 Lukas Slebodnik 2016-08-18 15:22:41 UTC

(In reply to Adam Williamson from comment #19)
> ...that's pretty rude.

I'm so sorry but I do not want to have assigned blocker bug on sssd if I cannnot
do anything with a bug. Log files does not lie (comment 6, comment 10)

If it wasn't a blocker bug then I would not mind it.

Comment 21 Adam Williamson 2016-08-22 15:33:31 UTC

Woohoo, latest openQA tests confirm this is working. Update has gone stable.

Note You need to log in before you can comment on or make changes to this bug.