Bug 1323400 - freeipa fails to start correctly after pki-core update on upgraded system
Summary: freeipa fails to start correctly after pki-core update on upgraded system
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: pki-core
Version: 23
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Fraser Tweedale
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://fedoraproject.org/wiki/Common...
Depends On:
Blocks: 1348976
TreeView+ depends on / blocked
 
Reported: 2016-04-02 14:28 UTC by Adam Williamson
Modified: 2020-10-04 21:08 UTC (History)
15 users (show)

Fixed In Version: pki-core-10.2.6-20.fc23
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1348976 (view as bug list)
Environment:
Last Closed: 2016-12-20 19:45:10 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journal from the affected time period (96.86 KB, text/plain)
2016-04-02 14:30 UTC, Adam Williamson
no flags Details
server.xml from affected system (15.57 KB, text/plain)
2016-04-03 16:39 UTC, Adam Williamson
no flags Details
tomcat ca 'system' log (22.46 KB, text/plain)
2016-04-04 19:20 UTC, Adam Williamson
no flags Details
pki-tomcat ca debug log (93.67 KB, text/plain)
2016-04-14 19:24 UTC, Adam Williamson
no flags Details
certutil output (3.79 KB, text/plain)
2016-04-19 18:51 UTC, Adam Williamson
no flags Details
ldapsearch output (3.24 KB, text/plain)
2016-04-19 18:52 UTC, Adam Williamson
no flags Details
openssl output (4.60 KB, text/plain)
2016-04-19 18:53 UTC, Adam Williamson
no flags Details
ipaupgrade.log from the clean F21 install -> F23 upgrade reproducer (3.25 MB, text/plain)
2016-06-22 02:17 UTC, Adam Williamson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github dogtagpki pki issues 2405 0 None closed freeipa fails to start correctly after pki-core update on upgraded system 2021-01-22 12:11:21 UTC
Red Hat Bugzilla 1289382 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 1289382

Description Adam Williamson 2016-04-02 14:28:42 UTC
I woke up this morning to find my FreeIPA server down. Looks like an automatic DNF update happened on the box and FreeIPA failed to recover afterwards. Attaching the journal for now; please let me know what else I can attach (maybe there's somewhere with more details on the pki-tomcat issue, but tomcat logging is insane so I've no idea where to look). The SSL cipher errors look obviously suspicious but I don't know how fatal they are. This is a system which has been upgraded all the way from F19.

Comment 1 Adam Williamson 2016-04-02 14:30:59 UTC
Created attachment 1142810 [details]
journal from the affected time period

Comment 2 Adam Williamson 2016-04-02 14:46:19 UTC
Trying to start IPA again fails in the same way. The TLS cipher issue seems the most likely culprit, I don't have time to start poking around at server.xml now, I'll try it later.

Comment 3 Adam Williamson 2016-04-03 16:38:05 UTC
So pki-tomcat last started successfully on March 28 at 14:18. It first failed to start on April 1 at 15:28. Between those times these packages were updated:

tzdata
pki-core (10.2.6-13.fc23 to 10.2.6-16.fc23)
freeipa (4.2.3-2.fc23 to 4.2.4-1.fc23)
sssd (1.13.3-5.fc23 to 1.13.3-6.fc23)
apache-commons-collections

the pki-core update did not trigger a pki-tomcat restart, which is why the problem only showed up after the freeipa update, but it's actually the cause of the problem. I just downgraded to 10.2.6-13.fc23 and IPA started up successfully after that.

Will attach my server.xml for inspection.

Comment 4 Adam Williamson 2016-04-03 16:39:09 UTC
Created attachment 1143048 [details]
server.xml from affected system

Comment 5 Ade Lee 2016-04-04 13:41:29 UTC
I don't see anything that necessarily indicates that the problem is ciphers.
Can you check the dogtag logs under /var/log/pki/pki-tomcat/ for any updates during the specified time?  We would particularly be interested in the ca/debug log, but sme exception/stack trace might be seen in the system log or others.

Comment 6 Adam Williamson 2016-04-04 15:47:38 UTC
I'll have a look, but I was looking at specifically these messages:

Apr 01 15:28:38 id.happyassassin.net server[6192]: WARNING: [SetAllPropertiesRule]{Server/Service/Connector} Setting property 'strictCiphers' to 'true' did not find a matching property.
Apr 01 15:28:38 id.happyassassin.net server[6192]: Apr 01, 2016 3:28:38 PM org.apache.catalina.startup.SetAllPropertiesRule begin
Apr 01 15:28:38 id.happyassassin.net server[6192]: WARNING: [SetAllPropertiesRule]{Server/Service/Connector} Setting property 'sslOptions' to 'ssl2=true,ssl3=true,tls=true' did not find a matching property.
Apr 01 15:28:38 id.happyassassin.net server[6192]: Apr 01, 2016 3:28:38 PM org.apache.catalina.startup.SetAllPropertiesRule begin
Apr 01 15:28:38 id.happyassassin.net server[6192]: WARNING: [SetAllPropertiesRule]{Server/Service/Connector} Setting property 'ssl2Ciphers' to '-SSL2_RC4_128_WITH_MD5,-SSL2_RC4_128_EXPORT40_WITH_MD5,-SSL2_RC2_128_CBC_WITH_MD5,-SSL2_RC2_128_CBC_EXPORT40_WITH_MD5,-SSL2_DES_64_CBC_WITH_MD5,-SSL2_DES_192_EDE3_CBC_WITH_MD5' did not find a matching property.
Apr 01 15:28:38 id.happyassassin.net server[6192]: Apr 01, 2016 3:28:38 PM org.apache.catalina.startup.SetAllPropertiesRule begin
Apr 01 15:28:38 id.happyassassin.net server[6192]: WARNING: [SetAllPropertiesRule]{Server/Service/Connector} Setting property 'ssl3Ciphers' to '-SSL3_FORTEZZA_DMS_WITH_NULL_SHA,-SSL3_FORTEZZA_DMS_WITH_RC4_128_SHA,+SSL3_RSA_WITH_RC4_128_SHA,-SSL3_RSA_EXPORT_WITH_RC4_40_MD5,+SSL3_RSA_WITH_3DES_EDE_CBC_SHA,+SSL3_RSA_WITH_DES_CBC_SHA,-SSL3_RSA_EXPORT_WITH_RC2_CBC_40_MD5,-SSL3_FORTEZZA_DMS_WITH_FORTEZZA_CBC_SHA,-SSL_RSA_FIPS_WITH_DES_CBC_SHA,+SSL_RSA_FIPS_WITH_3DES_EDE_CBC_SHA,-SSL3_RSA_WITH_NULL_MD5,-TLS_RSA_EXPORT1024_WITH_RC4_56_SHA,-TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA,+TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA' did not find a matching property.
Apr 01 15:28:38 id.happyassassin.net server[6192]: Apr 01, 2016 3:28:38 PM org.apache.catalina.startup.SetAllPropertiesRule begin
Apr 01 15:28:38 id.happyassassin.net server[6192]: WARNING: [SetAllPropertiesRule]{Server/Service/Connector} Setting property 'tlsCiphers' to '-TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA,-TLS_ECDH_ECDSA_WITH_3DES_EDE_CBC_SHA,+TLS_ECDH_RSA_WITH_3DES_EDE_CBC_SHA,+TLS_ECDH_RSA_WITH_AES_128_CBC_SHA,+TLS_ECDH_RSA_WITH_AES_256_CBC_SHA,-TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA,+TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,+TLS_RSA_WITH_3DES_EDE_CBC_SHA,+TLS_RSA_WITH_AES_128_CBC_SHA,+TLS_RSA_WITH_AES_256_CBC_SHA,+TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA,+TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,-TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA,-TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,-TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,+TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA,+TLS_DHE_DSS_WITH_AES_128_CBC_SHA,+TLS_DHE_DSS_WITH_AES_256_CBC_SHA,+TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA,+TLS_DHE_RSA_WITH_AES_128_CBC_SHA,+TLS_DHE_RSA_WITH_AES_256_CBC_SHA' did not find a matching property.

there are several other 'did not find a matching property' warnings indeed, but those seem to cover the entire TLS/SSL configuration. Still, you're the expert :)

Comment 7 Endi Sukma Dewata 2016-04-04 18:41:41 UTC
Hi, I think the above warnings are unrelated to the startup failure (see https://fedorahosted.org/pki/ticket/1352).

Could you attach the CA debug log file (i.e. /var/log/pki/pki-tomcat/ca/debug)?

Comment 8 Adam Williamson 2016-04-04 19:12:58 UTC
there doesn't seem to be an awful lot there, for the affected tries:

[01/Apr/2016:15:28:42][localhost-startStop-1]: ============================================
[01/Apr/2016:15:28:42][localhost-startStop-1]: =====  DEBUG SUBSYSTEM INITIALIZED   =======
[01/Apr/2016:15:28:42][localhost-startStop-1]: ============================================
[01/Apr/2016:15:34:04][localhost-startStop-1]: ============================================
[01/Apr/2016:15:34:04][localhost-startStop-1]: =====  DEBUG SUBSYSTEM INITIALIZED   =======
[01/Apr/2016:15:34:04][localhost-startStop-1]: ============================================
[02/Apr/2016:07:32:03][localhost-startStop-1]: ============================================
[02/Apr/2016:07:32:03][localhost-startStop-1]: =====  DEBUG SUBSYSTEM INITIALIZED   =======
[02/Apr/2016:07:32:03][localhost-startStop-1]: ============================================
netscape.ldap.LDAPException: Server or network error (81)
        at netscape.ldap.LDAPConnThread.networkError(Unknown Source)
        at netscape.ldap.LDAPConnThread.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:745)
[03/Apr/2016:09:36:34][localhost-startStop-1]: ============================================
[03/Apr/2016:09:36:34][localhost-startStop-1]: =====  DEBUG SUBSYSTEM INITIALIZED   =======
[03/Apr/2016:09:36:34][localhost-startStop-1]: ============================================
[03/Apr/2016:09:36:40][http-bio-8443-exec-2]: CMSServlet: curDate=Sun Apr 03 09:36:40 PDT 2016 id=caGetStatus time=27
[03/Apr/2016:15:46:24][localhost-startStop-2]: ============================================
[03/Apr/2016:15:46:24][localhost-startStop-2]: =====  DEBUG SUBSYSTEM INITIALIZED   =======
[03/Apr/2016:15:46:24][localhost-startStop-2]: ============================================

Note there are two or three startup attempts there - 01/Apr/2016:15:28:42 is where it initially went down - and only one apparently couldn't hit the LDAP server, that was probably while I was playing around trying to fix it. The other failures don't seem to log anything interesting. 03/Apr/2016:09:36:34 is when it started successfully after downgrade of pki-core packages.

Comment 9 Adam Williamson 2016-04-04 19:18:01 UTC
ca/system has some interesting (I think) lines:

0.profileChangeMonitor - [01/Apr/2016:15:28:12 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:13 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:14 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:15 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:16 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:17 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:18 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:19 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:20 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:21 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:22 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:23 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:24 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:25 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:26 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:27 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:28 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:29 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:30 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:31 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:32 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:33 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
0.profileChangeMonitor - [01/Apr/2016:15:28:34 PDT] [8] [3] In Ldap (bound) connection pool to host id.happyassassin.net port 636, Cannot connect to LDAP server. Error: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)

there is nothing after that in the log at all - no errors apparently appeared on the subsequent failed attempts to start IPA, or the successful one after the downgrade.

Comment 10 Adam Williamson 2016-04-04 19:20:14 UTC
Created attachment 1143433 [details]
tomcat ca 'system' log

Now I look through the whole system log, similar SSL socket errors appear quite regularly, so that's probably not the whole story...here's the whole file.

Comment 11 Matthew Harmsen 2016-04-04 23:50:52 UTC
(In reply to Adam Williamson from comment #3)
> So pki-tomcat last started successfully on March 28 at 14:18. It first
> failed to start on April 1 at 15:28. Between those times these packages were
> updated:
> 
> tzdata
> pki-core (10.2.6-13.fc23 to 10.2.6-16.fc23)
> freeipa (4.2.3-2.fc23 to 4.2.4-1.fc23)
> sssd (1.13.3-5.fc23 to 1.13.3-6.fc23)
> apache-commons-collections

What version of 'tomcat' are you using?

> 
> the pki-core update did not trigger a pki-tomcat restart, which is why the
> problem only showed up after the freeipa update, but it's actually the cause
> of the problem. I just downgraded to 10.2.6-13.fc23 and IPA started up
> successfully after that.
> 
> Will attach my server.xml for inspection.

Comment 12 Adam Williamson 2016-04-04 23:51:41 UTC
tomcat-8.0.32-5.fc23.noarch

Comment 13 Endi Sukma Dewata 2016-04-05 01:57:11 UTC
I'm unable to reproduce the problem on F23. I tried installing pki-core 10.2.6-13 and freeipa 4.2.3-2 then upgraded them to 10.2.6-16 and 4.2.4-1, respectively. The server seems to be working fine after upgrade and after restart.

It's possible there are remnants of F19 installation that are causing this problem that I cannot replicate with the above method, or there is something specific about your environment. Could you try increasing the debug level and try upgrading again? Thanks.

http://pki.fedoraproject.org/wiki/PKI_Server_Logs#Debug_Log

Comment 14 Adam Williamson 2016-04-14 19:24:16 UTC
sorry I got distracted :(

Here is the CA debug file; the bug is still happening. I don't see anything here, but maybe you will.

This is everything from after I enabled the debug settings. First I did `systemctl restart pki-tomcatd` , which appeared to go OK and the service showed as 'running' afterwards, but I didn't leave it very long. I instead did `systemctl restart ipa.service`, which takes a long time and eventually fails, with `systemctl --failed` showing ipa.service , kadmin.service , and pki-tomcatd as failed.

Comment 15 Adam Williamson 2016-04-14 19:24:46 UTC
Created attachment 1147363 [details]
pki-tomcat ca debug log

Comment 16 Adam Williamson 2016-04-14 19:28:07 UTC
I'm wondering if maybe tomcat is actually starting fine, but ipactl doesn't *think* it is, for some reason? I'm gonna take a quick look at how ipactl decides when something is done starting.

Comment 17 Adam Williamson 2016-04-14 19:35:02 UTC
aha, yes, it seems that's something like what's happening. ipactl -d output:

ipa: DEBUG: Waiting until the CA is running
ipa: DEBUG: Starting external process
ipa: DEBUG: args='/usr/bin/wget' '-S' '-O' '-' '--timeout=30' '--no-check-certificate' 'https://id.happyassassin.net:8443/ca/admin/ca/getStatus'
ipa: DEBUG: Process finished, return code=4
ipa: DEBUG: stdout=
ipa: DEBUG: stderr=--2016-04-14 12:32:51--  https://id.happyassassin.net:8443/ca/admin/ca/getStatus
Resolving id.happyassassin.net (id.happyassassin.net)... 192.168.1.12
Connecting to id.happyassassin.net (id.happyassassin.net)|192.168.1.12|:8443... connected.
Unable to establish SSL connection.

ipa: DEBUG: The CA status is: check interrupted due to error: Command ''/usr/bin/wget' '-S' '-O' '-' '--timeout=30' '--no-check-certificate' 'https://id.happyassassin.net:8443/ca/admin/ca/getStatus'' returned non-zero exit status 4
ipa: DEBUG: Waiting for CA to start...
ipa: DEBUG: Starting external process
ipa: DEBUG: args='/usr/bin/wget' '-S' '-O' '-' '--timeout=30' '--no-check-certificate' 'https://id.happyassassin.net:8443/ca/admin/ca/getStatus'
ipa: DEBUG: Process finished, return code=4
ipa: DEBUG: stdout=
ipa: DEBUG: stderr=--2016-04-14 12:33:22--  https://id.happyassassin.net:8443/ca/admin/ca/getStatus
Resolving id.happyassassin.net (id.happyassassin.net)... 192.168.1.12
Connecting to id.happyassassin.net (id.happyassassin.net)|192.168.1.12|:8443... connected.
Unable to establish SSL connection.

ipa: DEBUG: The CA status is: check interrupted due to error: Command ''/usr/bin/wget' '-S' '-O' '-' '--timeout=30' '--no-check-certificate' 'https://id.happyassassin.net:8443/ca/admin/ca/getStatus'' returned non-zero exit status 4
ipa: DEBUG: Waiting for CA to start...
ipa: DEBUG: Starting external process
ipa: DEBUG: args='/usr/bin/wget' '-S' '-O' '-' '--timeout=30' '--no-check-certificate' 'https://id.happyassassin.net:8443/ca/admin/ca/getStatus'

i.e. it's trying to hit a URL and failing. indeed if I try to hit the same URL manually, it also just times out.

Comment 18 Adam Williamson 2016-04-14 19:40:57 UTC
[adamw@id ~]$ curl -v https://localhost:8443/ca/admin/ca/getStatus
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
^C
[adamw@id ~]$ curl -v http://localhost:8443/ca/admin/ca/getStatus
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8443 (#0)
> GET /ca/admin/ca/getStatus HTTP/1.1
> Host: localhost:8443
> User-Agent: curl/7.43.0
> Accept: */*
> [hangs]

Comment 19 Adam Williamson 2016-04-14 19:43:00 UTC
ok, it gives up after a while:

[adamw@id ~]$ curl -v http://localhost:8443/ca/admin/ca/getStatus
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8443 (#0)
> GET /ca/admin/ca/getStatus HTTP/1.1
> Host: localhost:8443
> User-Agent: curl/7.43.0
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

Comment 20 Matthew Harmsen 2016-04-18 16:44:58 UTC
Upstream ticket:
https://fedorahosted.org/pki/ticket/2285

Comment 21 Fraser Tweedale 2016-04-18 22:29:58 UTC
Let us eliminate connection or certificate authentication issues.

Adam, can you confirm that DS is up and running on port 636, and that the certificate that it presents is not expired?

  $ echo | openssl s_client -showcerts -connect $HOSTNAME:636

Then check the certificate for the user that Dogtag uses to authenticate to DS:

  $ ldapsearch -D 'cn=Directory Manager' -w $DM_PASSWORD -b uid=pkidbuser,ou=people,o=ipaca

And check the details of the certificate that is actually being used by Dogtag:

  $ sudo certutil -d /etc/pki/pki-tomcat/alias -L -n "subsystemCert cert-pki-ca"

Comment 22 Adam Williamson 2016-04-19 18:51:31 UTC
Attaching results of each of those.

Comment 23 Adam Williamson 2016-04-19 18:51:58 UTC
Created attachment 1148714 [details]
certutil output

Comment 24 Adam Williamson 2016-04-19 18:52:58 UTC
Created attachment 1148716 [details]
ldapsearch output

Comment 25 Adam Williamson 2016-04-19 18:53:58 UTC
Created attachment 1148719 [details]
openssl output

Comment 26 Fraser Tweedale 2016-04-27 09:46:58 UTC
Thanks for the info Adam.  Dogtag 10.2.6-16 issue made profile
initialisation part of the startup process (it previously happened in
the background).  I guess the underlying problem is LDAP TLS connection,
and that the problem was already there, and simply exposed by the
startup changes in 10.2.6-16.  You may be able to start IPA with
v10.2.6-13, but are you able to issue certs or perform other operations
that hit the CA?

Let's eliminate authentication issues; the following command authenticates
to DS using Dogtag's client certificate:

  dnf install -y mozldap-tools

  /usr/lib64/mozldap/ldapsearch -Z \
    -P /etc/pki/pki-tomcat/alias \
    -W $(grep ^internal= /etc/pki/pki-tomcat/password.conf) \
    -N "subsystemCert cert-pki-ca"
    -b o=ipaca '(uid=pkidbuser)' seeAlso

If this command fails, we will focus on DS cert mapping.

If it succeeds, let's focus on cipher suites; please provide the contents
of /etc/dirsrv/slapd-HAPPYASSASSIN-NET/dse.ldif.

Comment 27 Adam Williamson 2016-04-27 09:53:55 UTC
"You may be able to start IPA with v10.2.6-13, but are you able to issue certs or perform other operations that hit the CA?"

not sure if I've done anything recently which would've caused a cert to be issued, indeed.

thanks for the next steps, will get to it when i wake up.

Comment 28 Adam Williamson 2016-05-03 18:25:53 UTC
Um. I don't know if I was supposed to substitute something for (uid=pkidbuser) (and if so, what?). I tried running the command as-is and this is what happens:

[root@id adamw]# /usr/lib64/mozldap/ldapsearch -Z \
>     -P /etc/pki/pki-tomcat/alias \
>     -W $(grep ^internal= /etc/pki/pki-tomcat/password.conf) \
>     -N "subsystemCert cert-pki-ca" \
> -b o=ipaca '(uid=pkidbuser)' seeAlso
Warning: Incorrect PIN may result in disabling the token
Enter PIN for NSS Certificate DB: 

I have no idea what the PIN for the NSS certificate DB would be. If I leave it blank I get:

ldapssl_enable_clientauth: Bad parameter to an ldap routine
ldapssl_enable_clientauth: additional info: bad key or key password
	SSL error -8177 (The security password entered is incorrect.)

Comment 29 Fraser Tweedale 2016-05-03 23:04:25 UTC
Adam, apologies - I made a mistake in the command.  The `-W' argument should be:

  -W $(grep ^internal= /etc/pki/pki-tomcat/password.conf | cut -d = -f 2)

Comment 30 Adam Williamson 2016-05-03 23:50:39 UTC
OK, with that change, I get:

[root@id adamw]# /usr/lib64/mozldap/ldapsearch -Z \
>     -P /etc/pki/pki-tomcat/alias \
> -W $(grep ^internal= /etc/pki/pki-tomcat/password.conf | cut -d = -f 2) \
> -N "subsystemCert cert-pki-ca" \
> -b o=ipaca '(uid=pkidbuser)' seeAlso
version: 1
dn: uid=pkidbuser,ou=people,o=ipaca
seeAlso: CN=CA Subsystem,O=HAPPYASSASSIN.NET

which Fraser tells me is fine. So we are now looking at cipher suites. So far I see this bit in dse.ldif:

allowWeakCipher: off
cn: encryption
createTimestamp: 20130923221704Z
creatorsName: cn=server,cn=plugins,cn=config
modifiersName: cn=Directory Manager
modifyTimestamp: 20141211014209Z
nsSSL2: off
nsSSL3: off
nsSSL3Ciphers: +all
nsSSLClientAuth: allowed
nsSSLSessionTimeout: 0
objectClass: top
objectClass: nsEncryptionConfig
numSubordinates: 1

I'm not sure what else you may need to see.

Comment 31 Fraser Tweedale 2016-05-04 04:14:37 UTC
On a fresh install of freeipa-server-4.2.4-1.fc23,
my dse.ldif has:

    dn: cn=encryption,cn=config
    allowWeakCipher: off
    cn: encryption
    createTimestamp: 20160504035626Z
    creatorsName: cn=server,cn=plugins,cn=config
    modifiersName: cn=directory manager
    modifyTimestamp: 20160504035907Z
    nsSSL3Ciphers: +all
    nsSSLClientAuth: allowed
    nsSSLSessionTimeout: 0
    objectClass: top
    objectClass: nsEncryptionConfig
    sslVersionMin: TLS1.0
    numSubordinates: 1

Perhaps a long shot but if you stop DS, update dse.ldif to remove the extra attributes:

    nsSSL2: off
    nsSSL3: off

And add the missing attribute:

    sslVersionMin: TLS1.0

Then restart DS... does Dogtag then come up properly?

Comment 32 Adam Williamson 2016-05-04 07:17:36 UTC
No, that doesn't seem to help.

Comment 33 Adam Williamson 2016-06-14 23:27:57 UTC
this is still happening, fwiw, even with latest pki-core (19).

Comment 34 Adam Williamson 2016-06-21 15:58:47 UTC
still happening after upgrade of server box to f24, and now I don't know if I can fix it with a package downgrade. argh.

Comment 35 Adam Williamson 2016-06-21 19:14:07 UTC
OK, luckily, can still get 10.2.6-13 installed on f24. So I've pinned this down a bit tighter: not just between 10.2.6-13 and 10.2.6-16, but 10.2.6-13 and 10.2.6-14. It works with 10.2.6-13.fc24.x86_64 , fails with 10.2.6-14.fc24.x86_64 . So one of the changes in -14 must be the culprit here.

Comment 36 Adam Williamson 2016-06-21 19:49:47 UTC
OK, so I've narrowed this down to a specific commit:

https://git.fedorahosted.org/cgit/pki.git/commit?id=5fae5826e5442d7266681d19f282dc7728062b89

that was included in -14.fc23 / -14.fc24 as a patch. I did a scratch build of 10.3.3 for F24 with that commit reverted:

http://koji.fedoraproject.org/koji/taskinfo?taskID=14596519

and with that, IPA starts up fine.

Of course, simply reverting that change doesn't look like the right 'fix', I guess the question is, why does this check for profile load completion break things in my setup? Is there a bug in the code, or do the profiles in question truly never load properly in my config, or what?

I guess I'll try poking at it with a big stick and spreading some debug statements around and see if I can work anything else, though I'm no Java or Tomcat expert.

Comment 37 Adam Williamson 2016-06-21 21:56:35 UTC
So I threw a couple of debugs into LDAPProfileSubsystem.java run():

        while (!stopped) {
            forgetAllProfiles();
            try {
                conn = dbFactory.getConn();
                LDAPSearchConstraints cons = conn.getSearchConstraints();
                cons.setServerControls(persistCtrl);
                cons.setBatchSize(1);
                cons.setServerTimeLimit(0 /* seconds */);
                String[] attrs = {"*", "entryUSN", "nsUniqueId", "numSubordinates"};
                LDAPSearchResults results = conn.search(
                    dn, LDAPConnection.SCOPE_SUB, "(objectclass=*)",
                    attrs, false, cons);
                CMS.debug("Got here #1");
                while (!stopped && results.hasMoreElements()) {
                    CMS.debug("Got here #2");

and we reach #1, but we never reach #2. The debug log hangs at this point:

[21/Jun/2016:14:47:31][profileChangeMonitor]: Profile change monitor: starting.
[21/Jun/2016:14:47:31][profileChangeMonitor]: In LdapBoundConnFactory::getConn()
[21/Jun/2016:14:47:31][profileChangeMonitor]: masterConn is connected: true
[21/Jun/2016:14:47:31][profileChangeMonitor]: getConn: conn is connected true
[21/Jun/2016:14:47:31][profileChangeMonitor]: getConn: mNumConns now 2
[21/Jun/2016:14:47:31][profileChangeMonitor]: Got here #1
[21/Jun/2016:14:47:31][CRLIssuingPoint-MasterCRL]: findNextUpdate:  fromLastUpdate: true  delta: false
[21/Jun/2016:14:47:31][CRLIssuingPoint-MasterCRL]: findNextUpdate:  Tue Jun 21 17:00:00 PDT 2016  delay: 7948646

I note that init() does this:

        monitor = new Thread(this, "profileChangeMonitor");
        monitor.start();
        try {
            initialLoadDone.await();
        } catch (InterruptedException e) {
            CMS.debug("LDAPProfileSubsystem: caught InterruptedException "
                    + "while waiting for initial load of profiles.");
        }
        CMS.debug("LDAPProfileSubsystem: finished init");
    }

So init() is never going to return until initialLoadDone - which is a CountDownLatch, see https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CountDownLatch.html - hits 0 (when that await() call will return). The *only* call that decrements initialLoadDone is in checkInitialLoadDone(), which is only ever called from within the LDAP search result loop.

So, basically, it's clear at one level what's going on: init() will never return if there are no results for the LDAPsearch near the start of run(), and on my system, there aren't any. I guess the question is, why not? And is it expected that the search may not return any results (in which case, the logic is broken and needs fixing), or should there *always* be results?

Comment 38 Adam Williamson 2016-06-21 22:44:42 UTC
Well, here's a thing. If I do this on my deployment:

/usr/lib64/mozldap/ldapsearch -Z -P /etc/pki/pki-tomcat/alias -W $(grep ^internal= /etc/pki/pki-tomcat/password.conf | cut -d = -f 2) -N "subsystemCert cert-pki-ca" -b o=ipaca '(objectclass=organizationalUnit) | grep "^ou"

I get the following ous:

ou: people
ou: groups
ou: requests
ou: ca
ou: ranges
ou: replica
ou: requests
ou: certificateRepository

on a clean F24 FreeIPA deployment, I get these:

ou: people
ou: groups
ou: requests
ou: ca
ou: ranges
ou: replica
ou: requests
ou: certificateRepository
ou: certificateProfiles
ou: authorities

note how my server has no certificateProfiles (or authorities). Could this be the problem here? My server was installed as F19 and has been upgraded from there (I think through 21 and 23 to 24), so perhaps this should have been added on upgrade at some point but was not?

Comment 39 Adam Williamson 2016-06-21 23:00:04 UTC
Note that pki_profiles_in_ldap is 'False' in /etc/pki/default.cfg on both my server and the fresh F24 deployment...

Comment 40 Adam Williamson 2016-06-21 23:30:13 UTC
I cannot find anything in the pki server upgrade scripts which would trigger import of profiles to LDAP:

[root@id upgrade]# pwd
/usr/share/pki/server/upgrade
[root@id upgrade]# grep -R -i profil *
10.3.0/02-AddAuthzRealmToRegistry:            'com.netscape.cms.profile.def.AuthzRealmDefault',
10.3.0/02-AddAuthzRealmToRegistry:            'com.netscape.cms.profile.constraint.AuthzRealmConstraint',

This commit added some code for importing profiles:

https://git.fedorahosted.org/cgit/pki.git/commit?id=e4869e62f432b510dc99eb7e00d16a23caa6ea64

but I just can't see anything suggesting that would get triggered on upgrade of a deployed system, it *seems* like that configure() stuff only gets hit on deployment, unless I'm misunderstanding that (it's not exactly easy for a Java noob to follow).

Comment 41 Fraser Tweedale 2016-06-22 01:56:24 UTC
Adam, this is super info, thanks for it!

(Sorry for lack of attention to this issue; very busy on IPA v4.4 work).

PKI is by default deployed to use the file-based profile subsystem.  FreeIPA
upgrade is responsible for switching Dogtag instances to the LDAP profile
subsystem, adding the out=certificateProfiles container and importing the
profiles to LDAP.  It seems that the instance was switched to 
LDAPProfileSubsystem but the container was not added and profiles not imported.

If you send me /var/log/ipaupgrade.log I'll have a look and try to
identify where things went wrong.

Could you please modify /etc/pki/pki-tomcat/ca/CS.cfg and replace:

   subsystem.1.class=com.netscape.cmscore.profile.LDAPProfileSubsystem

with:

   subsystem.1.class=com.netscape.cmscore.profile.ProfileSubsystem

Switch back to an official build (10.2.6-14 or later), then
run `ipa-server-upgrade`, and see how that goes.

Thanks!

Comment 42 Adam Williamson 2016-06-22 02:14:48 UTC
So I did a bit of testing to see when the certificateProfiles ou showed up. It's there in a fresh deployment on a clean network install of F22, but not of F21 - so it showed up either between F21 and F22, or during F22's lifetime. The F21 install has pki-server-10.2.1-3.fc21.noarch , F22 would be pki-server-10.2.6-12.fc22 , I think, so somewhere between 10.2.1 and 10.2.6 I guess.

I also successfully reproduced the bug by simply trying to upgrade the clean F21 freeipa deployment to F23. All I had to was net install F21, do 'ipa-server-install', then use dnf-plugin-system-upgrade to upgrade to F23.

The upgrade process takes a very long time, because as part of the scriptlets (after all the 'clean' tasks have apparently run), it winds up trying to start IPA, and this takes a long time and eventually fails, the failure I believe caused by this bug. After booting the upgraded system, ipa service start up again fails. The ldapsearch command shows the certificateProfiles ou is not present.

Will attach ipaupgrade.log from the test VM, then try the re-run of ipa-server-upgrade from #c41 and see what happens.

Comment 43 Adam Williamson 2016-06-22 02:17:35 UTC
Created attachment 1170513 [details]
ipaupgrade.log from the clean F21 install -> F23 upgrade reproducer

Comment 44 Fraser Tweedale 2016-06-22 02:46:32 UTC
Analysis indicdates that PKI restart fails because
LDAPProfileSubsystem startup fails due to ou=certificateProfiles
container not yet being added (race condition; the FreeIPA
upgrade behaviour that adds the container and imports profiles
waits on the restart, which fails because the container isn't there).

Proposed fix is the enhance LDAPProfileSubsystem to add the
ou=certificateProfiles container if it does not exist.
Fix should be backported to 10.2.x branch on f23.

What initially broke the upgrade is still unknown (the version
of pki was before the CountDownLatch was introduced).  The
exact versions were:

   12:37 < adamw> looks like i went from pki-core 10.2.1-3.fc21 to 10.2.6-11.fc23
   12:38 < adamw> and freeipa 4.1.4-1.fc21 to 4.2.2-1.fc23

Comment 45 Adam Williamson 2016-06-22 02:53:04 UTC
OK, so we figured out why it fails *now* in an F21 -> F23 upgrade. The upgrade code does this:

    if needs_update:
        installutils.set_directive(
            paths.CA_CS_CFG_PATH,
            directive,
            'com.netscape.cmscore.profile.LDAPProfileSubsystem',
            quotes=False,
            separator='=')

        ca.restart('pki-tomcat')

    root_logger.info('[Migrating certificate profiles to LDAP]')
    cainstance.migrate_profiles_to_ldap()

i.e. it changes the config file, then tries to restart pki-tomcat and migrate the profiles...only that's never going to work with pki-core -14 or higher, because pki-tomcat will fail to start as soon as the config file is changed and the whole thing then fails.

I confirmed that if I re-run the upgrade with pki-core downgraded to -13 , it works.

This doesn't explain why the upgrade failed *on my prod server*, however, because at the time I upgraded that, the pki-core change had not yet happened. Unfortunately ipaupgrade.log has been overwritten since that upgrade with later IPA upgrade attempts, and I can't find anything terribly useful in dnf history or system logs to explain why the ipa upgrade process failed on my initial F23 upgrade. It certainly *did* fail, though, we know that much. This is an extract from the journal during the system upgrade, right at the end of the 'Cleanup' step:

Nov 01 16:00:49 id.happyassassin.net dnf[302]: Cleanup     : libgcc-4.9.2-6.fc21.x86_64                            2304/2304
Nov 01 16:00:49 id.happyassassin.net dnf[302]: 1038 blocks
Nov 01 16:00:50 id.happyassassin.net systemd[1]: Reached target Network.
Nov 01 16:00:50 id.happyassassin.net systemd[1]: Starting Network.
Nov 01 16:00:50 id.happyassassin.net systemd[1]: Starting Identity, Policy, Audit...
Nov 01 16:00:51 id.happyassassin.net ipactl[16557]: Upgrade required: please run ipa-server-upgrade command
Nov 01 16:00:51 id.happyassassin.net ipactl[16557]: Aborting ipactl
Nov 01 16:00:51 id.happyassassin.net systemd[1]: ipa.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 16:00:51 id.happyassassin.net systemd[1]: Failed to start Identity, Policy, Audit.
Nov 01 16:00:51 id.happyassassin.net systemd[1]: ipa.service: Unit entered failed state.
Nov 01 16:00:51 id.happyassassin.net systemd[1]: ipa.service: Failed with result 'exit-code'.
Nov 01 16:00:51 id.happyassassin.net dnf[302]: IPA server upgrade failed: Inspect /var/log/ipaupgrade.log and run command ipa-server-upgrade manually.
Nov 01 16:00:51 id.happyassassin.net dnf[302]: Certmonger is not running. Start certmonger and run upgrade again.

I'm not sure if that certmonger error is the real one. Looking at the logs, I do recall that the upgrade to F23 was a big mess at the time, it's obvious that I ran into https://bugzilla.redhat.com/show_bug.cgi?id=1277224 , and I think there were some other bugs around too , like https://bugzilla.redhat.com/show_bug.cgi?id=1274905 and https://fedorahosted.org/pki/ticket/1310 . So I guess we'd just have to chalk my production server failure up to an unfortunate incomplete upgrade process - I guess one of those bugs that I ran into happened to cause it to fail in between changing the config file and importing the profiles...

Comment 46 Adam Williamson 2016-06-22 18:19:20 UTC
Quick note, Fraser - the pki-core change did land on F22 as well, so the fix for this will need to go to all Fedora branches (22, 23, 24, Rawhide) I believe.

Comment 47 Matthew Harmsen 2016-07-01 04:53:29 UTC
On 6/30/2016, ftweedal checked the fix into DOGTAG_10_2_6_BRANCH:

* f0d036feb9604cc656b3b8ae46c822bec14e6ac8

Comment 48 Fedora Update System 2016-07-01 06:41:46 UTC
pki-core-10.2.6-20.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-188c172b10

Comment 49 Adam Williamson 2016-07-01 15:42:45 UTC
thanks! how about the other branches?

Comment 50 Matthew Harmsen 2016-07-01 23:08:50 UTC
It will be built in F24 and F25 next week.

Do not intend building it for F22 due to lateness of its life cycle.

Comment 51 Fedora Update System 2016-07-02 20:55:43 UTC
pki-core-10.2.6-20.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-188c172b10

Comment 52 Adam Williamson 2016-08-26 21:58:36 UTC
The F23 update for this has passed the wait time limit and can be pushed stable, but no-one has done it yet. The F24 update has just reached a karma level at which it *can* be pushed (but not its auto-push threshold yet). The F25 update is stable (at least, I'm assuming https://bodhi.fedoraproject.org/updates/FEDORA-2016-0a384ead60 included the fix).

Comment 53 Fedora End Of Life 2016-11-24 16:19:51 UTC
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 54 Fedora Update System 2016-11-29 23:52:14 UTC
pki-core-10.2.6-20.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 55 Fedora End Of Life 2016-12-20 19:45:10 UTC
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.