Bug 1033273 - ipa cert automatic renew: cert automatically renewed, but renewed client cert not be recognized
Summary: ipa cert automatic renew: cert automatically renewed, but renewed client cert...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: ipa
Version: 7.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Martin Kosek
QA Contact: Michael Gregg
URL:
Whiteboard:
Depends On:
Blocks: 1060639
TreeView+ depends on / blocked
 
Reported: 2013-11-21 19:48 UTC by Yi Zhang
Modified: 2014-08-05 11:17 UTC (History)
3 users (show)

Fixed In Version: ipa-3.3.3-13.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1060639 (view as bug list)
Environment:
Last Closed: 2014-06-13 11:55:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1040018 0 medium CLOSED Automatic CA subsystem certificate renewal is broken on CA clones 2021-02-22 00:41:40 UTC

Internal Links: 1040018

Description Yi Zhang 2013-11-21 19:48:07 UTC
Description of problem:
  IPA cert do get automatic renewal when cert life time approached. However, the renewed cert is not being recognized by LDAP server. DS instance reports following error:

/var/log/dirsrv/slapd-<inst>/access
[03/Nov/2015:12:25:31 -0800] conn=9 SSL 256-bit AES; client CN=CA Subsystem,O=YZHANG.REDHAT.COM; issuer CN=Certificate Authority,O=YZHANG.REDHAT.COM
[03/Nov/2015:12:25:31 -0800] conn=9 SSL failed to map client certificate to LDAP DN (Could not matching certificate in User's LDAP entry)
[03/Nov/2015:12:25:31 -0800] conn=9 op=0 BIND dn="" method=sasl version=3 mech=EXTERNAL
[03/Nov/2015:12:25:31 -0800] conn=9 op=0 RESULT err=49 tag=97 nentries=0 etime=0

The direct result is that IPA server can not start due to pki-tomcatd server fails to restart, which is due to connection error (with LDAP server) as below:

/var/log/pki/pki-tomcat/ca/debug
[03/Nov/2015:12:25:32][localhost-startStop-1]: CMS:Caught EBaseException
Internal Database Error encountered: Could not connect to LDAP server host rh7a.yzhang.redhat.com port 636 Error netscape.ldap.LDAPException: Authentication failed (49)

which cause abortion of ipa start up.




Version-Release number of selected component (if applicable):
[root@rh7a (RH7.0-x86_64) slapd-YZHANG-REDHAT-COM] rpm -qa | grep "sss\|ipa"
sssd-ad-1.11.2-1.el7.x86_64
sssd-client-1.11.2-1.el7.x86_64
python-iniparse-0.4-8.el7.noarch
sssd-krb5-common-1.11.2-1.el7.x86_64
libipa_hbac-1.11.2-1.el7.x86_64
python-sssdconfig-1.11.2-1.el7.noarch
sssd-common-1.11.2-1.el7.x86_64
sssd-1.11.2-1.el7.x86_64
sssd-ldap-1.11.2-1.el7.x86_64
ipa-server-3.3.3-3.el7.x86_64
device-mapper-multipath-libs-0.4.9-58.el7.x86_64
libsss_idmap-1.11.2-1.el7.x86_64
libipa_hbac-python-1.11.2-1.el7.x86_64
sssd-common-pac-1.11.2-1.el7.x86_64
ipa-admintools-3.3.3-3.el7.x86_64
libsss_nss_idmap-1.11.2-1.el7.x86_64
sssd-proxy-1.11.2-1.el7.x86_64
ipa-python-3.3.3-3.el7.x86_64
ipa-client-3.3.3-3.el7.x86_64
device-mapper-multipath-0.4.9-58.el7.x86_64
sssd-krb5-1.11.2-1.el7.x86_64
sssd-ipa-1.11.2-1.el7.x86_64
[root@rh7a (RH7.0-x86_64) slapd-YZHANG-REDHAT-COM] uname -a
Linux rh7a.yzhang.redhat.com 3.10.0-48.el7.x86_64 #1 SMP Fri Nov 8 10:11:01 EST 2013 x86_64 x86_64 x86_64 GNU/Linux


How reproducible: always


Steps to Reproduce:
1. install ipa server
2. find out when ipa cert would expire
3. stop all ipa services and pki-tomcat.target service
4. adjust system time 1 day before expiration date.
5. start all ipa service and pki-tomcat.target service
6. wait till all about-to-expire certs gets renewed
7. stop all ipa service and pki-tomcat.target service
8. adjust system time (again) 1 day AFTER expiration date
9. start ipa service and pki-tomcat.target service

--- expect all services start up but they all failed. 




Additional info:  I tried to increase the LDAP log level but I couldn't see more useful information. I have a VM ready for developers to login to dig more

Comment 3 Rob Crittenden 2013-11-22 15:00:35 UTC
The pkidbuser/CA subsystem certificate wasn't updated in uid=pkidbuser,ou=people,o=ipaca

This should happen in the renew_ca_cert, a post-save command for the subsystemCert.

I see this in the logs:

/var/log/messages-20151102:Nov  1 13:14:06 rh7a abrt: detected unhandled Python exception in '/usr/lib64/ipa/certmonger/renew_ca_cert'

The backtrace is

service.py:166:__wait_until_running:RuntimeError: CA did not start in 120s

Traceback (most recent call last):
  File "/usr/lib64/ipa/certmonger/renew_ca_cert", line 87, in <module>
    update_cert_config(nickname, cert)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/cainstance.py", line 
1732, in update_cert_config
    quotes=False, separator='=')
  File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/lib/python2.7/site-packages/ipaserver/install/installutils.py", lin
e 822, in stopped_service
    ipaservices.knownservices[service].start(instance_name)
  File "/usr/lib/python2.7/site-packages/ipapython/platform/fedora16/service.py"
, line 172, in start
    self.__wait_until_running()
  File "/usr/lib/python2.7/site-packages/ipapython/platform/fedora16/service.py"
, line 166, in __wait_until_running
    raise RuntimeError('CA did not start in %ss' % timeout)
RuntimeError: CA did not start in 120s

Local variables in innermost frame:
status: 'check interrupted'
self: <ipapython.platform.fedora16.service.Fedora16CAService object at 0x2974c10
>
op_timeout: 1446412445.687159
use_proxy: True
timeout: 120

The debug log reports this:

[01/Nov/2015:13:12:17][ajp-bio-127.0.0.1-8009-exec-1]: according to ccMode, authorization for servlet: caGetStatus is LDAP based, not XML {1}, use default authz mgr: {2}.
[01/Nov/2015:13:14:41][localhost-startStop-1]: ============================================
[01/Nov/2015:13:14:41][localhost-startStop-1]: ============================================
[01/Nov/2015:13:14:41][localhost-startStop-1]: =====  DEBUG SUBSYSTEM INITIALIZED   =======
[01/Nov/2015:13:14:41][localhost-startStop-1]: ============================================
...
...
[01/Nov/2015:13:14:41][localhost-startStop-1]: CMS:Caught EBaseException
Internal Database Error encountered: Could not connect to LDAP server host rh7a.yzhang.redhat.com port 636 Error netscape.ldap.LDAPException: Authentication failed (49)

Could this mean we failed to update the ou=People entry because the CA failed to start because the ou=People entry was invalid?

Yi, I'd look at your times closely. Not all of the certs expire at the same time, and moving past the expiration on some and not others might cause something like this.

Comment 4 Yi Zhang 2013-11-22 18:51:55 UTC
Rob:

The test script understands that not all certs expires at same time. It has logic to sort the cert based on their life time, and adjust the system time to days before the first soon-to-be expire certs. 

Such logic has build into the script and it handles it well in the RHEL6.5 release. But I will double check the logic and post the related information  here.

Comment 5 Rob Crittenden 2013-11-22 19:44:51 UTC
Ok, this very well could be a bug on our logic. I'm not sure if can depend on CS coming back up after renewing that cert, so some ordering may need to change.

Knowing the ordering and timing of the renewals would help sort things out though.

Comment 6 Yi Zhang 2013-11-25 19:13:49 UTC
Rob, here is a failed beaker job. the log file shows the life time of certs and how the system time being adjusted.

https://beaker.engineering.redhat.com/jobs/547876

[step 1] initial cert status:
+-------------- all IPA certs (round 1) Days left to the CA Cert life limit [20 Year 3 Day(s) 23 h 59 m 54 s] ----------------------------+ ===== Summary: are all certs valid? [yes] 
current system date [Wed Nov 20 09:55:27 EST 2013], 
=== [valid certs]: 
caSSLServiceCert (Server-Cert cert-pki-ca) #3 : [Wed Nov 20 09:51:00 EST 2013]~~[Tue Nov 10 09:51:00 EST 2015] expires@(719 D 23 h 55 m 32 s) life [720 D 0 s] 

oscpSubsystemCert (ocspSigningCert cert-pki-ca) #2 : [Wed Nov 20 09:51:00 EST 2013]~~[Tue Nov 10 09:51:00 EST 2015] expires@(719 D 23 h 55 m 32 s) life [720 D 0 s] 

caSubsystemCert (subsystemCert cert-pki-ca) #4 : [Wed Nov 20 09:51:01 EST 2013]~~[Tue Nov 10 09:51:01 EST 2015] expires@(719 D 23 h 55 m 32 s) life [720 D 0 s] 

caAuditLogCert (auditSigningCert cert-pki-ca) #5 : [Wed Nov 20 09:51:02 EST 2013]~~[Tue Nov 10 09:51:02 EST 2015] expires@(719 D 23 h 55 m 33 s) life [720 D 0 s] 

ipaAgentCert (ipaCert) #7 : [Wed Nov 20 09:51:22 EST 2013]~~[Tue Nov 10 09:51:22 EST 2015] expires@(719 D 23 h 55 m 51 s) life [720 D 0 s] 

ipaServiceCert_ds (Server-Cert) #8 : [Wed Nov 20 09:51:27 EST 2013]~~[Sat Nov 21 09:51:27 EST 2015] expires@(730 D 23 h 55 m 55 s) life [731 D 0 s] 

ipaServiceCert_httpd (Server-Cert) #9 : [Wed Nov 20 09:53:02 EST 2013]~~[Sat Nov 21 09:53:02 EST 2015] expires@(730 D 23 h 57 m 28 s) life [731 D 0 s]


better picture:
#--------------- record NotAfter_sec of current certs --------------#
current date [Wed Nov 20 09:55:37 EST 2013]
caSSLServiceCert 1447167060 = Tue Nov 10 09:51:00 EST 2015
oscpSubsystemCert 1447167060 = Tue Nov 10 09:51:00 EST 2015
caSubsystemCert 1447167061 = Tue Nov 10 09:51:01 EST 2015
caAuditLogCert 1447167062 = Tue Nov 10 09:51:02 EST 2015
ipaAgentCert 1447167082 = Tue Nov 10 09:51:22 EST 2015
ipaServiceCert_ds 1448117487 = Sat Nov 21 09:51:27 EST 2015
ipaServiceCert_httpd 1448117582 = Sat Nov 21 09:53:02 EST 2015
#-------------------------------------------------------------------#

[step 2] calculation : what is proper time to adjust

     current date : Wed Nov 20 09:55:55 EST 2013 (1384959355)
     autorenew    : Tue Nov 10 09:06:00 2015  (1447164360)
     certExpire   : Tue Nov 10 09:51:00 2015  (1447167060)
     postExpire   : Wed Nov 11 09:51:00 2015  (1447253460)

Based on the above calculation, the system will be adjust to "autorenew" time, it is 45 minutes before the first certs (caSSLServiceCert) expires (it is certExpire time in above calculation)


[step 3] adjust system time and stop, start ipa service. 
this step is also being heavily under radar as the order of adjust system and stop start service is very critical. The order is below:

(1) stop tomcat.target; stop ipa services
(2) adjust system time to autorenew time
(3) start ipa services, tomcat.target
(4) give system time to react to the cert renewal operation by wait for 30 minutes (1800 s)
(5) stop tomcat.target; stop ipa service
(6) adjust system time to postExpire time
(7) start ipa service, tomcat.target
(8) check the cert status, verify currently loaded certs are valid (by check life time of certs), then check the ipa function: kinit as admin, do ldap search via ldaps, and create a new service, issue a new cert



After system time being adjust to post-cert-expire time, script will check whether soon-to-expire cers being renewed. It will produce a cert renew report as following

   +------------------- Cert Renew report (1)-----------------+
   |           caSSLServiceCert : renewed [ 1 ] times         |
   |          oscpSubsystemCert : renewed [ 1 ] times         |
   |            caSubsystemCert : renewed [ 1 ] times         |
   |             caAuditLogCert : renewed [ 1 ] times         |
   |               ipaAgentCert : renewed [ 1 ] times         |
   |          ipaServiceCert_ds : renewed [ 1 ] times         |
   |       ipaServiceCert_httpd : renewed [ 1 ] times         |

what the current status is: ipa server can not started, the above report is based on "certutil" result parsing. 

Yi

Comment 7 Martin Kosek 2013-11-27 08:17:21 UTC
Rob, can you please advise on Yi's findings? Do we have a flaw somewhere in our renew logic?

Comment 8 Rob Crittenden 2013-11-27 14:34:40 UTC
There are a couple of shell script errors being reported (looks to be non-fatal):

* Bareword found where operator expected at -e line 1, near "/etc/httpd"
	(Missing operator before httpd?)
syntax error at -e line 1, near "/etc/httpd"

* ./t.autorenewcert.sh: line 90: [: nicknames in /etc/httpd/alias
TESTRELM.COM IPA CA
ipaCert
Server-Cert
Signing-Cert: integer expression expected

You shouldn't need to restart dogtag during renewal. The renewal process itself should do that, unless you've seen specific failures when you haven't.

You display the audit cert post-renewal but not any of the other certs? Having the full output like earlier in the log would be nice, like:

caSSLServiceCert     (Server-Cert cert-pki-ca)      #3 : [Wed Nov 20 09:51:00 EST 2013]~~[Tue Nov 10 09:51:00 EST 2015] expires@(719 D 23 h 55 m 32 s) life [720 D 0 s]

I'm not sure that showing an entire certificate is all that helpful, it may just pollute the log. Then again, easier to skip over parts of a verbose log than to be left wanting.

It looks like certmonger is restarted again post-renewal. Is there a reason for that or am I misreading the log?

You should verify that DS and Apache certs are renewed as well as without those IPA will not operate properly. There are some references to it in some of the output but it is inconsistent. There appears to be three separate functions to display current cert validity status with varying levels of detail. For example, if the DS cert isn't renewed then it won't restart post-renewal and cause a cascade of failures but I'm not sure this would be immediately apparent at the point of failure.

It is failing to include the dogtag logs because it is still using /var/log/pki-ca rather than /var/log/pki/pki-tomcat/

The logs are very difficult to follow since it isn't always explicit what is being restarted by the script itself. I see the ipa and certmonger services being restarted over and over.

IPA looks like it has been shut down gracefully which is why a lot of the later tests are failing.

For example:

Nov 11 09:51:05 qe-blade-12 kernel: [ 2559.322963] perf samples too long (2532 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Nov 11 09:51:05 qe-blade-12 systemd: Started PKI Tomcat Server pki-tomcat.
Nov 11 09:51:26 qe-blade-12 logger: 2015-11-11 09:51:26 /usr/bin/rhts-test-runner.sh 632448649 2580 hearbeat...
Nov 11 09:52:26 qe-blade-12 logger: 2015-11-11 09:52:26 /usr/bin/rhts-test-runner.sh 632448649 2640 hearbeat...
Nov 11 09:53:07 qe-blade-12 systemd: Stopping Kerberos 5 KDC...

ipa is started, then shut down 2 minutes later. Is this because dogtag failed to start and we saw the 120 second timeout? I can't tell.

Some other things you may want to check/future enhancements:

Get a before and after snapshot of these values in LDAP:

uid=ipara,ou=People,o=ipaca
uid=pkidbuserou=People,o=ipaca

The first is the ipaCert user, the second is the subsystem user. ipaCert is the RA cert that IPA uses and is needed for us to bind and the second is used by dogtag itself to authenticate to the LDAP server. You should notice the userCertificate value changing to match the new cert and the description changed to reflect the updated serial number.

It would also be good to know that things are set up for clone renewal so display the contents of all entries in cn=ca_renewal,cn=ipa,cn=etc,$SUFFIX.

When a CA certificate is renewed a copy should be put into LDAP to be shared with other masters. This includes the ipaCert value.

Comment 9 Martin Kosek 2013-11-27 14:46:21 UTC
Setting needsinfo to Yi to analyze Rob's feedback and fix the tests to provide us with more information about the renewal + double check that this is really not an error in the test.

Comment 10 Yi Zhang 2013-12-02 14:48:55 UTC
Martin/Rob:

-- for the beaker job I posted above: --
* the shell script error is no harm. It has been fixed. The place where it reports error only relates to root ca checking. It is an isolated testcase.

* Restarting dogtag and certmonger are ideas inherent from last release. I never know I don't have to restart them in this release. It doesn't seen affect the test result. I will change the test logic. 

* The log file location has been corrected but not reflected in the above test. The modified test script has correct log location (/var/log/pki/pki-tomcat/)

* There are quite some ipa restart caused by itself (120 seconds timeout). 

-- the beaker job is more for the demonstration of cert lifetime listing AND how the test logic is. The script errors have been fixed later on. I can fetch a new beaker job to reflect the latest code. The best place to troubleshooting is the vm I setup.

-- I will setup two vms this morning so you can check the status of IPA server upon your following request:

"""
Get a before and after snapshot of these values in LDAP:

uid=ipara,ou=People,o=ipaca
uid=pkidbuserou=People,o=ipaca
"""

I will set these two vm in identical OS + bits. I will set one day before the first certs expires (autorenew time), another will be one day after certs expires (post expire time), so you can do some comparison check ups. 

I will send you via email once the environment is ready.

Comment 11 Yi Zhang 2013-12-02 19:31:06 UTC
it took me a while, but the vms are ready now! Email has been send.

one vm's system clock is 3 days before ipa certs expire
one vm's system clock is 1 day after ipa certs expire. As you can see, ipa certs do get renewed, but dirsrv can not start due to ssl error that relate to mapping

Comment 13 Martin Kosek 2013-12-11 08:46:36 UTC
I am wondering if Bug 1040018 is related in any way.

Comment 14 Jan Cholasta 2013-12-11 09:43:32 UTC
I don't think so, uid=pkidbuser,ou=people,o=ipaca is updated only on the master CA, it shouldn't be affected by anything that happens on clone CAs.

Comment 15 Dmitri Pal 2013-12-18 22:29:07 UTC
Upstream ticket:
https://fedorahosted.org/freeipa/ticket/4078

Comment 16 Martin Kosek 2014-01-15 16:53:09 UTC
master:
7ce3320996e9666581a2b4495b5063b5d99aa937 Do not start the service in stopped_service if it was not running before.
cb411ff94e71579e2a608319b1c7eab877b98019 ipaserver/install/installutils: clean up properly after yield

ipa-3-3:
8e874b1ede0ec76114a185f62c50ff71905e97f9 Do not start the service in stopped_service if it was not running before.
f4739bc5217713cc208db292053a5f5665ea2284 ipaserver/install/installutils: clean up properly after yield

Comment 18 Martin Kosek 2014-01-17 09:15:07 UTC
I forgot to include the original patch to increase default timeout:

master: be2c5d22b37b4673ac0bdeb8cab01be19fab288a
ipa-3-3: 876a00a8da99f71b2fe0c4420489e3d2813c0b5c

Moving to POST.

Comment 21 Michael Gregg 2014-02-08 01:15:38 UTC
Verified against ipa-server-3.3.3-17.el7.x86_64

Installed IPA.

Examine cert:

[root@nocp11 ~]# ipa cert-show 2
  Subject: CN=OCSP Subsystem,O=TESTRELM.TEST
  Issuer: CN=Certificate Authority,O=TESTRELM.TEST
  Not Before: Thu Feb 06 22:23:03 2014 UTC
  Not After: Wed Jan 27 22:23:03 2016 UTC
  Fingerprint (MD5): 4d:69:87:db:10:26:2c:22:f6:26:e9:71:e5:4f:4d:46
  Fingerprint (SHA1): e5:59:c6:9c:c8:f3:da:62:a7:94:60:c4:89:ff:33:22:09:0f:0e:ba
  Serial number (hex): 0x2
  Serial number: 2

[root@nocp11 ~]# ipactl stop
Stopping Directory Service
Stopping ipa-otpd Service
Stopping pki-tomcatd Service
Stopping httpd Service
Stopping ipa_memcached Service
Stopping named Service
Stopping kadmin Service
Stopping krb5kdc Service
ipa: INFO: The ipactl command was successful


[root@nocp11 ~]# date --set="26 Jan 2016 22:20:00"
Wed Jan 27 22:20:00 EST 2016

[root@nocp11 ~]# ipactl start
Starting Directory Service
Starting krb5kdc Service
Starting kadmin Service
Starting named Service
Starting ipa_memcached Service
Starting httpd Service
Starting pki-tomcatd Service
Starting ipa-otpd Service
ipa: INFO: The ipactl command was successful

<Wait 245 hours!>

[root@nocp11 ~]# date 
Thu Jan 27 22:20:00 EST 2016


[root@nocp11 ~]# ipactl stop
Stopping Directory Service
Stopping ipa-otpd Service
Stopping pki-tomcatd Service
Stopping httpd Service
Stopping ipa_memcached Service
Stopping named Service
Stopping kadmin Service
Stopping krb5kdc Service
ipa: INFO: The ipactl command was successful

[root@nocp11 ~]# date --set="28 Jan 2016 22:20:00"
Thu Jan 28 22:20:00 EST 2016
[root@nocp11 ~]# ipactl start
Starting Directory Service
Starting krb5kdc Service
Starting kadmin Service
Starting named Service
Starting ipa_memcached Service
Starting httpd Service
Starting pki-tomcatd Service
Starting ipa-otpd Service
ipa: INFO: The ipactl command was successful

Comment 22 Ludek Smid 2014-06-13 11:55:02 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.


Note You need to log in before you can comment on or make changes to this bug.