Bug 1695297 - Upgrading a FreeIPA server from F29 to F30 broke with 389-ds-base-1.4.1.2-2.fc30
Summary: Upgrading a FreeIPA server from F29 to F30 broke with 389-ds-base-1.4.1.2-2.fc30
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: freeipa
Version: 30
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: IPA Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa AcceptedBlocker
Depends On:
Blocks: F30FinalBlocker
TreeView+ depends on / blocked
 
Reported: 2019-04-02 19:16 UTC by Adam Williamson
Modified: 2019-04-11 02:14 UTC (History)
18 users (show)

Fixed In Version: freeipa-4.7.2-7.fc30 freeipa-4.7.2-8.fc30
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-11 02:14:03 UTC


Attachments (Terms of Use)
Full tarball of /var/log from the failed test (server) (759.65 KB, application/octet-stream)
2019-04-02 19:17 UTC, Adam Williamson
no flags Details
log tarball from a more recent failure (without the yacc errors) (4.84 MB, application/octet-stream)
2019-04-02 20:26 UTC, Adam Williamson
no flags Details
named.run with debug output from bind (819.70 KB, text/plain)
2019-04-05 00:07 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2019-04-02 19:16:25 UTC
This update:

https://bodhi.fedoraproject.org/updates/FEDORA-2019-d2572d74ef

has broken the openQA test that checks FreeIPA server upgrade.

What the test basically does is deploy a FreeIPA server and client on Fedora 29, upgrade both to Fedora 30, and then check that everything works properly afterwards.

That test was passing fine for all Fedora 30 updates, until FEDORA-2019-d2572d74ef , where it failed:

https://openqa.fedoraproject.org/tests/374534

I did not notice this quickly enough to prevent the update being automatically pushed stable, unfortunately. Since it was pushed stable, the test is failing for *all* updates.

I'll attach the /var/log tarball from the failed test; that contains the whole of /var/log. Hopefully it should be enough to sort out the problem.

Comment 1 Adam Williamson 2019-04-02 19:17:05 UTC
Created attachment 1551136 [details]
Full tarball of /var/log from the failed test (server)

Comment 2 Adam Williamson 2019-04-02 19:49:03 UTC
So, something I noticed: in the logs of this failure, we can see that python3-pycparser is not upgraded as part of the 29->30 upgrade. It should be upgraded from python3-pycparser-2.14-15.fc29 to python3-pycparser-2.14-18.fc30 but it seems it just is not.

Looking at the logs from the most recent time the test passed, just before the 389-ds-base update went stable, pycparser *does* get updated as part of the upgrade in those logs.

I cannot immediately see any reason why the 389-ds-base update would result in pycparser not getting upgraded, though, so that's rather odd. I'm currently fiddling about with the test to try and figure this out.

Comment 3 Adam Williamson 2019-04-02 20:24:35 UTC
OK, so it seems the pycparser thing is a red herring.

The first failure of this test - the one for the 389-ds-base update itself, whose logs I was looking at - happened 4 days ago; that's 2 days before the pycparser update went stable. So of course pycparser wasn't included in the upgrade.

I just did a new run of the test and checked the logs; pycparser is in the upgrade now, and the 'WARNING: yacc table file version is out of date' errors are gone, but the test is still failing and we still see this error:

Apr 02 13:15:08 ipa001.domain.local ipactl[743]: IPA version error: data needs to be upgraded (expected version '4.7.2-6.fc30', current version '4.7.2-1.1.fc29')
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Automatically running upgrade, for details see /var/log/ipaupgrade.log
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Be patient, this may take a few minutes.
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Failed to start named Service
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Shutting down
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Hint: You can use --ignore-service-failure option for forced start in case that a non-critical service failed
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Aborting ipactl
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Starting Directory Service
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Starting krb5kdc Service
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Starting kadmin Service
Apr 02 13:15:08 ipa001.domain.local ipactl[743]: Starting named Service
Apr 02 13:15:09 ipa001.domain.local systemd[1]: ipa.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 13:15:09 ipa001.domain.local systemd[1]: ipa.service: Failed with result 'exit-code'.
Apr 02 13:15:09 ipa001.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=ipa comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Apr 02 13:15:09 ipa001.domain.local systemd[1]: Failed to start Identity, Policy, Audit.

so, there's definitely something else going on here.

Comment 4 Adam Williamson 2019-04-02 20:26:55 UTC
Created attachment 1551139 [details]
log tarball from a more recent failure (without the yacc errors)

Comment 5 Adam Williamson 2019-04-02 20:30:22 UTC
These seem to be the errors from bind:

Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: running as: named-pkcs11 -u named -c /etc/named.conf
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: compiled by GCC 9.0.1 20190209 (Red Hat 9.0.1-0.4)
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: compiled with libxml2 version: 2.9.9
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: linked to libxml2 version: 20909
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: compiled with libjson-c version: 0.13.1
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: linked to libjson-c version: 0.13.1
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: compiled with zlib version: 1.2.11
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: linked to zlib version: 1.2.11
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: threads support is enabled
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: ----------------------------------------------------
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: BIND 9 is maintained by Internet Systems Consortium,
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: Inc. (ISC), a non-profit 501(c)(3) public-benefit
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: corporation.  Support and training for BIND 9 are
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: available at https://www.isc.org/support
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: ----------------------------------------------------
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: adjusted limit on open files from 524288 to 1048576
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: found 2 CPUs, using 2 worker threads
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: using 1 UDP listener per interface
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: using up to 21000 sockets
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: Configuration.cpp(94): Missing log.level in configuration. Using default value: INFO
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: Configuration.cpp(122): Missing slots.removable in configuration. Using default value: false
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: loading configuration from '/etc/named.conf'
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: unable to open '/etc/named.iscdlv.key'; using built-in keys instead
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: initializing GeoIP Country (IPv4) (type 1) DB
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GEO-106FREE 20180327 Build 1 Copyright (c) 2018 MaxMind Inc All Rights Reserved
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP Country (IPv6) (type 12) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP City (IPv4) (type 2) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP City (IPv4) (type 6) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP City (IPv6) (type 30) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP City (IPv6) (type 31) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP Region (type 3) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP Region (type 7) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP ISP (type 4) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP Org (type 5) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP AS (type 9) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP Domain (type 11) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GeoIP NetSpeed (type 10) DB not available
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: using default UDP/IPv4 port range: [32768, 60999]
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: using default UDP/IPv6 port range: [32768, 60999]
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: listening on IPv6 interfaces, port 53
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: listening on IPv4 interface lo, 127.0.0.1#53
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: listening on IPv4 interface eth0, 10.0.2.100#53
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: generating session key for dynamic DNS
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: sizing zone task pool based on 6 zones
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: none:106: 'max-cache-size 90%' - setting to 1791MB (out of 1990MB)
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: set up managed keys zone for view _default, file '/var/named/dynamic/managed-keys.bind'
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: loading DynDB instance 'ipa' driver '/usr/lib64/bind/ldap.so'
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: bind-dyndb-ldap version 11.1 compiled at 14:39:32 Jan 31 2019, compiler 9.0.1 20190129 (Red Hat 9.0.1-0.2)
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GSSAPI client step 1
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: GSSAPI client step 1
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: LDAP error: Invalid credentials: bind to LDAP server failed
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: couldn't establish connection in LDAP connection pool: permission denied
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: dynamic database 'ipa' configuration failed: permission denied
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: loading configuration: permission denied
Apr 02 13:15:01 ipa001.domain.local named-pkcs11[2696]: exiting (due to fatal error)

That looks like the real problem, I guess. dirsrv access log has this:

[02/Apr/2019:16:15:01.711235030 -0400] conn=9 op=0 RESULT err=49 tag=97 nentries=0 etime=0.0001799901 - SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Permission denied)

Comment 6 Adam Williamson 2019-04-03 19:46:54 UTC
Proposing as an F30 Final blocker, this violates Beta criterion "It must be possible to successfully complete a direct upgrade from a fully updated installation of each of the last two stable Fedora Server releases with the system configured as a FreeIPA domain controller or postgresql server as specified in the relevant criteria. The upgraded system must meet all relevant release criteria, including criteria relating to functionality of the server software" - https://fedoraproject.org/wiki/Fedora_30_Beta_Release_Criteria#Server_upgrade_requirements

Comment 7 Adam Williamson 2019-04-03 19:49:16 UTC
As expected, now the update is stable this is affecting F30 nightly composes tests as well as update tests...

https://openqa.fedoraproject.org/tests/377169

Rawhide also seems to be affected.

Comment 8 mreynolds 2019-04-03 20:01:05 UTC
> 
> That looks like the real problem, I guess. dirsrv access log has this:
> 
> [02/Apr/2019:16:15:01.711235030 -0400] conn=9 op=0 RESULT err=49 tag=97
> nentries=0 etime=0.0001799901 - SASL(-1): generic failure: GSSAPI Error:
> Unspecified GSS failure.  Minor code may provide more information
> (Permission denied)

This just means the kerberos authentication failed.  Permission denied, but for what?   It's a bit vague, and it could be an issue outside of 389-ds-base.  What was the previous version of 389-ds-base where the upgrade passed?  Can that version be tested against the latest IPA server?  It would be nice to rule out some of these variables.

Comment 9 Adam Williamson 2019-04-03 20:20:29 UTC
"It's a bit vague, and it could be an issue outside of 389-ds-base."

The reason I'm quite confident 389-ds-base is the culprit is update testing. We test updates as they are submitted to Bodhi; the test works by using the current stable repo plus a side repo which contains *only* the packages from the update. Here is the Fedora 30 history of the failing test:

https://openqa.fedoraproject.org/tests/377203#next_previous

change it to show 100 entries. You will note that the test consistently passed (well, soft failed - that's because of AVCs) for all updates for several weeks, until FEDORA-2019-caf75e133e (which was the update for 389-ds-base-1.4.1.2-1.fc30, where the test failed because it caught the dependency issue in the package that prevented it actually being installed as part of the test), and FEDORA-2019-d2572d74ef , which is the update for 389-ds-base-1.4.1.2-2.fc30: https://openqa.fedoraproject.org/tests/374534

note that *only* those two instances of the test failed until recently. *After* those two failures, there are 8 other instances of the test being run on other updates, which all pass (soft fail). The last of these occurs at 2019-04-02 00:41 UTC. But then suddenly the test starts failing on *all* updates, from FEDORA-2019-fb4683a2ee onwards. That test occurred at 2019-04-02 08:29 UTC .

What happened between 00:41 UTC and 08:29 UTC on 2019-04-02? FEDORA-2019-d2572d74ef - that is, 389-ds-base-1.4.1.2-2.fc30 - was pushed stable:

https://bodhi.fedoraproject.org/updates/FEDORA-2019-d2572d74ef

OK, the 'has been pushed to stable' message is actually timestamped 00:02 UTC, but in fact there is some lag between that message appearing and the package actually showing up in the 'stable' repo that the openQA tests use. So this all matches up very precisely: at first the test failed *only* for the 389-ds-base update, passing for all other updates before and after, then when the update was pushed stable, the test suddenly started failing on all other updates too.

There has been no FreeIPA update in Fedora 30 at least since Bodhi was enabled on March 5th, so there's no "latest IPA server" to confuse things here, there's no FreeIPA update that could be involved.

Comment 10 mreynolds 2019-04-03 21:26:23 UTC
I'm not saying its not 389-ds-base, but all I've been told is that this particular error message is why everything is failing.  But that error message is just saying that a SASL bind failed - which is not necessarily a bug in Directory Server.  Binds can fail for a lot of reasons: Bad client, bad credentials, invalid kerberos cache, etc.  I'm not sure what the "permission denied" sub-error message is.  Maybe it's a SElinux issue or simply invalid credentials?   After the initial failure is there a "bad" behaviour that we can reproduce on that system?   What is that test doing exactly?  What is doing the SASL bind that is failing since you said IPA was not involved.  How is the DS instance being installed?  

I just feel like the root cause has not been found, so I don't know what is actually wrong, or how to reproduce it locally.   There was one fix done in 1.4.1 that addressed a SASL bind crash - I will remove that fix and provide a scratch build to test.  I'll let you know when its ready...

Comment 11 Adam Williamson 2019-04-03 22:36:25 UTC
I mean, that's just the obvious error that leaps out at me from the logs. The entire tarball of /var/log is attached to this bug; you can download that and peruse all the available logs at your leisure.

The 'bad' behaviour and the reason for the test failure is simply that FreeIPA itself - ipa.service - does not start up, because the name server (this specific instance of bind that FreeIPA uses) is a critical component of FreeIPA and it's failing to start. The test is written to check, after the upgrade process completes and the upgraded system is booted, that ipa.service started successfully - the test fails at this point, because that service did *not* start successfully.

FreeIPA is certainly involved, this is a FreeIPA test - I'm just saying that in Fedora 30 FreeIPA itself has not *changed* for several weeks, so there isn't a possibility that a change in FreeIPA's behaviour is what triggered the bug.

Comment 12 Adam Williamson 2019-04-03 22:56:58 UTC
Note, if you need me to set config options to get debug output from things, I can do that. Just let me know what needs to be set and I can tweak the test to do it and re-run it.

Comment 13 Adam Williamson 2019-04-03 23:03:32 UTC
Oh, one other note, the bug here seems to be *specific to the upgrade test*. There is a test which just deploys a FreeIPA server on F30 and then tests it with clients and stuff, and that test passed fine. The test that fails involves deploying a server and client on F29, then upgrading them to F30 and checking they still work after the upgrade. That's the test that fails. So the issue is likely to be related to whatever migrations happen in configuration etc. between the version that was in F29 and this new 1.4.1 series, I guess?

Comment 14 thierry bordaz 2019-04-04 06:47:25 UTC
Looking at the attached logs it looks that the failure is related to named being unable to start because of the LDAP bind failure
/var/log/messages
Apr  2 16:15:01 ipa001 named-pkcs11[2696]: LDAP error: Invalid credentials: bind to LDAP server failed
Apr  2 16:15:01 ipa001 named-pkcs11[2696]: couldn't establish connection in LDAP connection pool: permission denied
Apr  2 16:15:01 ipa001 named-pkcs11[2696]: dynamic database 'ipa' configuration failed: permission denied
Apr  2 16:15:01 ipa001 named-pkcs11[2696]: loading configuration: permission denied
Apr  2 16:15:01 ipa001 named-pkcs11[2696]: exiting (due to fatal error)
Apr  2 16:15:01 ipa001 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=named-pkcs11 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Apr  2 16:15:01 ipa001 systemd[1]: named-pkcs11.service: Control process exited, code=exited, status=1/FAILURE
Apr  2 16:15:01 ipa001 systemd[1]: named-pkcs11.service: Failed with result 'exit-code'.
Apr  2 16:15:01 ipa001 systemd[1]: Failed to start Berkeley Internet Name Domain (DNS) with native PKCS#11.

Could you get help from someone from named ? Just to confirm that at the failure step named should authenticate successfully.

during this failure, is it possible to run strace againt DS (open syscall) to know more about 'permission denied'.

Comment 15 Petr Menšík 2019-04-04 09:38:16 UTC
I think there is some problem with connection from named to LDAP, ldap refuses bind-dyndb-ldap connection with LDAP_INVALID_CREDENTIALS, which plugin translates to ISC_R_NOPERM.
Can you verify used kerberos ktab can be used successfully by named user to connect to ldap server?

Would this work?:
$ sudo -u named /bin/bash -i
$ kinit -k -t /etc/named.keytab
$ ldapsearch 

Some more details could be turned on by named-pkcs11 parameter -d 10 added to /etc/sysconf/named OPTIONS. I am not sure if it would log to syslog or to /var/named/data/named.run, one of that places for sure.

Comment 16 Adam Williamson 2019-04-04 22:11:48 UTC
"Could you get help from someone from named ? Just to confirm that at the failure step named should authenticate successfully."

It should. Compare the log from the equivalent non-upgrade test, which passes:

https://openqa.fedoraproject.org/tests/377028/file/role_deploy_domain_controller_check-var_log.tar.gz

There, we see this:

Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: bind-dyndb-ldap version 11.1 compiled at 14:39:32 Jan 31 2019, compiler 9.0.1 20190129 (Red Hat 9.0.1-0.2)
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 1
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 1
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 1
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 2
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 1
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 1
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 1
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: GSSAPI client step 2
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 10.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 16.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 17.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 18.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 19.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 20.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 21.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 22.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 23.172.IN-ADDR.ARPA
Apr 03 06:15:54 ipa001.domain.local named-pkcs11[359]: automatic empty zone: 24.172.IN-ADDR.ARPA

whereas in the failure case, we see the error after the first two "GSSAPI client step 1" messages.

Comment 17 Adam Williamson 2019-04-04 23:17:48 UTC
"during this failure, is it possible to run strace againt DS (open syscall) to know more about 'permission denied'."

that might be a *bit* tricky as it's the FreeIPA service that starts up all these other bits and has them talk to each other, I'd have to look at how to hack them up to run 389-ds via strace (and doing so might possibly make it so slow we run into timing issues, I guess...)

"Would this work?:
$ sudo -u named /bin/bash -i
$ kinit -k -t /etc/named.keytab
$ ldapsearch
"

I hacked up the test to try that (well, more or less - as the tests usually run as root, I used runuser, not sudo). The result is that the 'kinit' command returns "kinit: Keytab contains no suitable keys for host/ipa001.domain.local@DOMAIN.LOCAL while getting initial credentials", and ldapsearch returns "ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1)". Is that helpful?

Trying to get the extra log info from -d 10 now, not sure if this named-pkcs11 variant picks up the /etc/sysconfig option, though.

Comment 18 Adam Williamson 2019-04-05 00:07:32 UTC
Created attachment 1552157 [details]
named.run with debug output from bind

Ah, the debug output did show up in named.run, here it is.

Comment 19 mreynolds 2019-04-05 18:03:33 UTC
DS & krb5kdc.log analysis
-----------------------------------------------

First SASL bind in the DS log fails, but we get a "preauth pkinit failure" just before the bind:


[02/Apr/2019:15:50:35.610857758 -0400] conn=1 fd=64 slot=64 connection from ::1 to ::1

------> krb5 log:
    # Apr 02 15:50:39 ipa001.domain.local krb5kdc[29193](Error): preauth pkinit failed to initialize: PKINIT initialization failed: Cannot read certificate file '/var/kerberos/krb5kdc/kdc.crt': No such file or directory
<------

[02/Apr/2019:15:51:47.552330009 -0400] conn=115 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
[02/Apr/2019:15:51:47.554081296 -0400] conn=115 op=0 RESULT err=49 tag=97 nentries=0 etime=0.0001817590 - SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Permission denied)
[02/Apr/2019:15:51:47.559604714 -0400] conn=115 op=-1 fd=104 closed - B1

Then in the krb5 log we see this when the bind fails:

Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](info): AS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19)}) 10.0.2.100: NEEDED_PREAUTH: host/ipa001.domain.local@DOMAIN.LOCAL for krbtgt/DOMAIN.LOCAL@DOMAIN.LOCAL, Additional pre-authentication required
Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](info): closing down fd 11
Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](Error): No default authorization data types available, no authorization data will be added.
Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](info): AS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19)}) 10.0.2.100: ISSUE: authtime 1554234707, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, host/ipa001.domain.local@DOMAIN.LOCAL for krbtgt/DOMAIN.LOCAL@DOMAIN.LOCAL
Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](info): closing down fd 11
Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](Error): No default authorization data types available, no authorization data will be added.
Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](info): TGS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 10.0.2.100: ISSUE: authtime 1554234707, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, host/ipa001.domain.local@DOMAIN.LOCAL for ldap/ipa001.domain.local@DOMAIN.LOCAL
Apr 02 15:51:47 ipa001.domain.local krb5kdc[29195](info): closing down fd 11


Then the server is restarted, and now all the SASL binds work

[02/Apr/2019:15:52:41.716465243 -0400] conn=1 fd=64 slot=64 connection from local to /var/run/slapd-DOMAIN-LOCAL.socket
...
[02/Apr/2019:15:52:43.077956409 -0400] conn=7 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
[02/Apr/2019:15:52:43.088752711 -0400] conn=7 op=0 RESULT err=14 tag=97 nentries=0 etime=0.0010837884, SASL bind in progress
[02/Apr/2019:15:52:43.088947593 -0400] conn=7 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
[02/Apr/2019:15:52:43.089528579 -0400] conn=7 op=1 RESULT err=14 tag=97 nentries=0 etime=0.0000609022, SASL bind in progress
[02/Apr/2019:15:52:43.089837381 -0400] conn=7 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
[02/Apr/2019:15:52:43.090567725 -0400] conn=7 op=2 RESULT err=0 tag=97 nentries=0 etime=0.0000758520


But I still see krb5 error messages like this even though the binds seem to be working (so maybe this is a red herring?):

Apr 02 15:52:43 ipa001.domain.local krb5kdc[29195](info): AS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19)}) 10.0.2.100: NEEDED_PREAUTH: host/ipa001.domain.local@DOMAIN.LOCAL for krbtgt/DOMAIN.LOCAL@DOMAIN.LOCAL, Additional pre-authentication required
Apr 02 15:52:43 ipa001.domain.local krb5kdc[29195](info): closing down fd 11
Apr 02 15:52:43 ipa001.domain.local krb5kdc[29196](Error): No default authorization data types available, no authorization data will be added.
Apr 02 15:52:43 ipa001.domain.local krb5kdc[29196](info): AS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19)}) 10.0.2.100: ISSUE: authtime 1554234763, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, host/ipa001.domain.local@DOMAIN.LOCAL for krbtgt/DOMAIN.LOCAL@DOMAIN.LOCAL
Apr 02 15:52:43 ipa001.domain.local krb5kdc[29196](info): closing down fd 11
Apr 02 15:52:43 ipa001.domain.local krb5kdc[29195](Error): No default authorization data types available, no authorization data will be added.
Apr 02 15:52:43 ipa001.domain.local krb5kdc[29195](info): TGS_REQ (8 etypes {aes256-cts-hmac-sha1-96(18), aes128-cts-hmac-sha1-96(17), aes256-cts-hmac-sha384-192(20), aes128-cts-hmac-sha256-128(19), DEPRECATED:des3-cbc-sha1(16), DEPRECATED:arcfour-hmac(23), camellia128-cts-cmac(25), camellia256-cts-cmac(26)}) 10.0.2.100: ISSUE: authtime 1554234763, etypes {rep=aes256-cts-hmac-sha1-96(18), tkt=aes256-cts-hmac-sha1-96(18), ses=aes256-cts-hmac-sha1-96(18)}, host/ipa001.domain.local@DOMAIN.LOCAL for ldap/ipa001.domain.local@DOMAIN.LOCAL
Apr 02 15:52:43 ipa001.domain.local krb5kdc[29195](info): closing down fd 11



The server is restarted and I see this error message in the krb5 log, but the server is restarted again before another SASL bind is attempted:

    Apr 02 15:53:07 ipa001.domain.local krb5kdc[32442](Error): preauth pkinit failed to initialize: PKINIT initialization failed: Cannot open file '/var/kerberos/krb5kdc/cacert.pem': No such file or directory


After that restart the SASL binds are working correctly.  Then the server is restarted one more time, and I don't see any "preauth pkinit" error messages in the krb5 log, but it fails on first SASL bind attempt again:

[02/Apr/2019:16:15:00.902916822 -0400] conn=1 fd=64 slot=64 connection from ::1 to ::1
...
[02/Apr/2019:16:15:01.706762826 -0400] conn=9 fd=68 slot=68 connection from local to /var/run/slapd-DOMAIN-LOCAL.socket
[02/Apr/2019:16:15:01.709547501 -0400] conn=9 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
[02/Apr/2019:16:15:01.711235030 -0400] conn=9 op=0 RESULT err=49 tag=97 nentries=0 etime=0.0001799901 - SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Permission denied)
[02/Apr/2019:16:15:01.714371581 -0400] conn=9 op=1 UNBIND
[02/Apr/2019:16:15:01.714385946 -0400] conn=9 op=1 fd=68 closed - U1



Well I don't know if the krb5 errors are related to the randomly failing SASL binds as they seem to be inconsistent with the good and bad SASL binds.  Do we know which failed bind broke the test?  Was it the first one at 02/Apr/2019:15:51:47 or the second one at 02/Apr/2019:16:15:01 ?

Anyway, I'm not sure what would have changed in DS to change how SASL binds work.  There was one fix done around SASL binds but I don't see how that could explain these errors.  I might still do a scratch build so we can rule it out, but I'm not hopeful it is the root cause.  

Also from what I am seeing the "Permission denied" error can be related to the key tab, but that doesn't explain why it fails, then works, then fails.  Could IPA be updating the key.tab file multiple times?  A "ls -l /etc/dirsrv" of the key tabs would be nice to see when they were last modified.  Do we need to do another test run to get a fresh set of this data?  Or was it also tarred up?

Comment 21 Adam Williamson 2019-04-05 18:40:18 UTC
There's a run at https://openqa.stg.fedoraproject.org/tests/512457 where you should be able to correlate the log messages. See https://openqa.stg.fedoraproject.org/tests/512457#downloads for all the blobs - the named.run file is there, also the /var/log tarball, from which you should be able to figure out what the system was doing at the various times. This is *not* the same run as the named.run attached to this bug, note, I had to tweak things a bit to get it cleaner.

Also you may find autoinst-log.txt possibly useful; that's effectively the test execution log. It has a *lot* of stuff in it which will not make much sense to you, but if you want to find out when a certain command was actually typed in, just search that log file for the command and you should be able to find it, with a timestamp (UTC I believe).

Comment 23 Lukas Slebodnik 2019-04-05 20:12:31 UTC
It is quite simple to explain why it is problem just in upgrade.
(especially if you check state of machine after upgrade from f29->f30)

sh-5.0# systemctl --failed
  UNIT                               LOAD   ACTIVE SUB    DESCRIPTION                                            
● ipa.service                        loaded failed failed Identity, Policy, Audit                                
● named-pkcs11.service               loaded failed failed Berkeley Internet Name Domain (DNS) with native PKCS#11
● NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online                            

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

3 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

sh-5.0# find /etc/ -name "*.rpmsave"
/etc/sysconfig/dirsrv.systemd.rpmsave
/etc/sysconfig/dirsrv.rpmsave

sh-5.0# ipactl start
Existing service file detected!
Assuming stale, cleaning and proceeding
Starting Directory Service
Starting krb5kdc Service
Starting kadmin Service
Starting named Service
Failed to start named Service
Shutting down
Hint: You can use --ignore-service-failure option for forced start in case that a non-critical service failed
Aborting ipactl


sh-5.0# find /etc/ -name "*.rpmsave"
/etc/sysconfig/dirsrv.systemd.rpmsave
/etc/sysconfig/dirsrv.rpmsave
sh-5.0# cp /etc/sysconfig/dirsrv.rpmsave /etc/sysconfig/dirsrv

sh-5.0# ipactl start
Existing service file detected!
Assuming stale, cleaning and proceeding
Starting Directory Service
Starting krb5kdc Service
Starting kadmin Service
Starting named Service
Starting httpd Service
Starting ipa-custodia Service
Starting pki-tomcatd Service
Starting ipa-otpd Service
Starting ipa-dnskeysyncd Service
ipa: INFO: The ipactl command was successful

Comment 24 Lukas Slebodnik 2019-04-05 20:14:26 UTC
sh# diff -u /etc/sysconfig/dirsrv.rpmsave /etc/sysconfig/dirsrv
--- /etc/sysconfig/dirsrv.rpmsave       2019-04-05 14:45:05.564230448 -0400
+++ /etc/sysconfig/dirsrv       2019-04-05 19:26:01.412897691 -0400
@@ -1,51 +1 @@
-# This file is sourced by dirsrv upon startup to set
-# the default environment for all directory server instances.
-# To set instance specific defaults, use the file in the same
-# directory called dirsrv-instance where "instance"
-# is the name of your directory server instance e.g.
-# dirsrv-localhost for the slapd-localhost instance.
-
-# This file is in systemd EnvironmentFile format - see man systemd.exec
-
-# In order to make more file descriptors available to the directory server,
-# first make sure the system hard limits are raised, then use ulimit -
-# uncomment out the following line and change the value to the desired value
-#ulimit -n 8192
-# note - if using systemd, ulimit won't work -  you must edit the systemd unit
-# file for directory server to add the LimitNOFILE option - see "man
-# systemd.exec" for more info
-
-# A per instance keytab does not make much sense for servers.  Kerberos clients
-# use the machine FQDN to obtain a ticket like ldap/FQDN, there is nothing that
-# can make a client understand how to get a per-instance ticket.  Therefore by
-# default a keytab should be considered a per server option.
-#
-# Also this file is sourced for all instances, so again all instances would
-# ultimately get the same keytab.
-#
-# Finally a keytab is normally named either krb5.keytab or <service>.keytab
-#
-# In order to use SASL/GSSAPI (Kerberos) the directory server needs to know
-# where to find its keytab file - uncomment the following line and set the
-# path and filename appropriately.
-# If using systemd, omit the "; export VARNAME" at the end.
-#KRB5_KTNAME=/etc/dirsrv/myname.keytab ; export KRB5_KTNAME
-
-# How many seconds to wait for the startpid file to show up before we assume
-# there is a problem and fail to start.
-# If using systemd, omit the "; export STARTPID_TIME" at the end.
-#STARTPID_TIME=10 ; export STARTPID_TIME
-
-# How many seconds to wait for the pid file to show up before we assume there
-# is a problem and fail to start.
-# If using systemd, omit the "; export PID_TIME" at the end.
-#PID_TIME=600 ; export PID_TIME
-
-# jemalloc is a general purpose malloc implementation that emphasizes
-# fragmentation avoidance and scalable concurrency support.  jemalloc
-# has been shown to have a significant positive impact on the Directory
-# Server's process size/growth.
-LD_PRELOAD=/usr/lib64/dirsrv/lib/libjemalloc.so.2
-ulimit -c unlimited
 KRB5CCNAME=/tmp/krb5cc_389
-KRB5_KTNAME=/etc/dirsrv/ds.keytab

^^^ you can see issue on the last line.
It explain why bind-dyndb-ldap could not connect to dirsrv and thus named-pkcs11 failed to start.

Comment 25 Lukas Slebodnik 2019-04-05 20:24:45 UTC
I would say the issue is that you removed too much config options
- %config(noreplace)%{_sysconfdir}/sysconfig/%{pkgname}
- %config(noreplace)%{_sysconfdir}/sysconfig/%{pkgname}.systemd 


Especially considering that it should be still included by systemd unit file

sh# systemctl cat dirsrv@EXAMPLE-TEST.service | grep Env
EnvironmentFile=-/etc/sysconfig/dirsrv
EnvironmentFile=-/etc/sysconfig/dirsrv-%i

I assume that %{_sysconfdir}/sysconfig/%{pkgname}.systemd is not needed due to upstream changes
But you still need to keep %{_sysconfdir}/sysconfig/%{pkgname}

Mark,
Does it make sense to you?

Or you you want to do some scriptlet magic to cover upgrade.

Comment 26 mreynolds 2019-04-05 21:25:50 UTC
(In reply to Lukas Slebodnik from comment #25)
> I would say the issue is that you removed too much config options
> - %config(noreplace)%{_sysconfdir}/sysconfig/%{pkgname}
> - %config(noreplace)%{_sysconfdir}/sysconfig/%{pkgname}.systemd 
> 
> 
> Especially considering that it should be still included by systemd unit file
> 
> sh# systemctl cat dirsrv@EXAMPLE-TEST.service | grep Env
> EnvironmentFile=-/etc/sysconfig/dirsrv
> EnvironmentFile=-/etc/sysconfig/dirsrv-%i
> 
> I assume that %{_sysconfdir}/sysconfig/%{pkgname}.systemd is not needed due
> to upstream changes
> But you still need to keep %{_sysconfdir}/sysconfig/%{pkgname}
> 
> Mark,
> Does it make sense to you?
> 
> Or you you want to do some scriptlet magic to cover upgrade.

Lukas thanks for the investigation!  You found the root cause, but unfortunately its not just a spec file change.  It was a commit that happened pretty early on in 1.4.1, and now we have a lot of code that was written around it that can not easily be undone. 

See https://pagure.io/389-ds-base/pull-request/50213

It looks like FreeIPA have done their part, but the fix was not released on Fedora yet:

https://pagure.io/freeipa/issue/7860

So once FreeIPA does their next build this should all work.

Comment 27 Lukas Slebodnik 2019-04-05 22:57:49 UTC
(In reply to mreynolds from comment #26)
> (In reply to Lukas Slebodnik from comment #25)
> > I would say the issue is that you removed too much config options
> > - %config(noreplace)%{_sysconfdir}/sysconfig/%{pkgname}
> > - %config(noreplace)%{_sysconfdir}/sysconfig/%{pkgname}.systemd 
> > 
> > 
> > Especially considering that it should be still included by systemd unit file
> > 
> > sh# systemctl cat dirsrv@EXAMPLE-TEST.service | grep Env
> > EnvironmentFile=-/etc/sysconfig/dirsrv
> > EnvironmentFile=-/etc/sysconfig/dirsrv-%i
> > 
> > I assume that %{_sysconfdir}/sysconfig/%{pkgname}.systemd is not needed due
> > to upstream changes
> > But you still need to keep %{_sysconfdir}/sysconfig/%{pkgname}
> > 
> > Mark,
> > Does it make sense to you?
> > 
> > Or you you want to do some scriptlet magic to cover upgrade.
> 
> Lukas thanks for the investigation!  You found the root cause, but
> unfortunately its not just a spec file change.  It was a commit that
> happened pretty early on in 1.4.1, and now we have a lot of code that was
> written around it that can not easily be undone. 
> 
> See https://pagure.io/389-ds-base/pull-request/50213
> 
> It looks like FreeIPA have done their part, but the fix was not released on
> Fedora yet:
> 
> https://pagure.io/freeipa/issue/7860
> 
> So once FreeIPA does their next build this should all work.

Then you should probably move BZ to different component.

Comment 28 Fedora Update System 2019-04-06 09:25:20 UTC
freeipa-4.7.2-7.fc30 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-540b1780f5

Comment 29 Alexander Bokovoy 2019-04-06 13:15:19 UTC
Looks like this problem is fixed with freeipa-4.7.2-7.fc30. While https://openqa.fedoraproject.org/tests/378554 and https://openqa.fedoraproject.org/tests/378555 both soft-failed, the server was deployed, upgraded, and started after upgrade just fine. The upgrade issue that is seen in the server upgrade is a hang and the same hang I experienced in F29 to F30 upgrade on a workstation alone so it is not related to FreeIPA/389-ds-base.

Comment 30 Adam Williamson 2019-04-06 14:23:44 UTC
Yep, both those results are in the expected range, so indeed the bug appears to be fixed. Thanks, everyone!

Comment 31 Fedora Update System 2019-04-07 02:44:35 UTC
freeipa-4.7.2-8.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-540b1780f5

Comment 32 Adam Williamson 2019-04-09 00:39:02 UTC
Discussed at 2019-04-08 blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2019-04-08/ . Accepted as a blocker as a violation of Beta criterion "It must be possible to successfully complete a direct upgrade from a fully updated installation of each of the last two stable Fedora Server releases with the system configured as a FreeIPA domain controller or postgresql server as specified in the relevant criteria. The upgraded system must meet all relevant release criteria, including criteria relating to functionality of the server software."

Comment 33 Fedora Update System 2019-04-11 02:14:03 UTC
freeipa-4.7.2-8.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.