Bug 1956991

Summary: During ipa-server-upgrade, CA tries to connect to ns-slapd right after directory server is restarted, and fails
Product: Red Hat Enterprise Linux 8 Reporter: Paul Nickerson <pnickerson>
Component: ipaAssignee: Thomas Woerner <twoerner>
Status: NEW --- QA Contact: ipa-qe <ipa-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.3CC: abokovoy, fcami, fdc, ftrivino, ipa-maint, jcholast, jhrozek, mhjacks, pvoborni, rcritten, ssorce, tscherf, twoerner
Target Milestone: beta   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Logs and command outputs none

Description Paul Nickerson 2021-05-04 19:49:48 UTC
Created attachment 1779495 [details]
Logs and command outputs

Description of problem:
When ipaupgrade runs, PKI Tomcat's CA tries to connect to 389 Directory Server's ns-slapd (LDAP) and fails with connection refused because ns-slapd happens to be restarting in that 4 second time window. CA dies at this moment. Later on, ipaupgrade tries to connect to CA and fails with internal server error, and errors out.

Version-Release number of selected component (if applicable):
ipa --version: VERSION: 4.8.7, API_VERSION: 2.239
Red Hat Enterprise Linux 8.3 (Ootpa)

How reproducible:
Happens every time I try to run "ipactl restart". Haven't tried this on a fresh system.

Steps to Reproduce:
1. ipactl restart

Actual results:
Failure reported and all services are stopped.

Expected results:
Success reported and all services are running.

Additional info:

I had need to run "ipactl restart", and got this message:
IPA version error: data needs to be upgraded (expected version '4.8.7-16.module+el8.3.0+10289+b6566038', current version '4.8.7-13.module+el8.3.0+8376+0bba7131')
Automatically running upgrade, for details see /var/log/ipaupgrade.log
Be patient, this may take a few minutes.

After a little time, this completes with failure:
[Migrating certificate profiles to LDAP]
IPA server upgrade failed: Inspect /var/log/ipaupgrade.log and run command ipa-server-upgrade manually.
Unexpected error - see /var/log/ipaupgrade.log for details:
RemoteRetrieveError: Failed to authenticate to CA REST API

/var/log/ipaupgrade.log
2021-05-04T18:17:39Z DEBUG SUCCESS: port: 8443
2021-05-04T18:17:39Z DEBUG Start of pki-tomcatd@pki-tomcat.service complete
2021-05-04T18:17:40Z DEBUG args=['/bin/systemctl', 'stop', 'dirsrv@SEMI-CASHSTAR-NET.service']
2021-05-04T18:17:42Z DEBUG args=['/bin/systemctl', 'start', 'dirsrv@SEMI-CASHSTAR-NET.service']
2021-05-04T18:17:45Z DEBUG args=['/bin/systemctl', 'is-active', 'dirsrv@SEMI-CASHSTAR-NET.service']
2021-05-04T18:17:55Z DEBUG request GET https://ipa01.semi.cashstar.net:8443/ca/rest/account/login
2021-05-04T18:17:55Z DEBUG request body ''
2021-05-04T18:17:55Z DEBUG response status 500
CA subsystem unavailable. Check CA debug log.

/var/log/pki/pki-tomcat/ca/debug.2021-05-04.log
2021-05-04 18:17:43 [main] INFO: Initializing CA subsystem
2021-05-04 18:17:43 [main] FINE: LdapBoundConnection: Connecting to ipa01.semi.cashstar.net:636 with client cert auth
2021-05-04 18:17:43 [main] SEVERE: Unable to create socket: java.net.ConnectException: Connection refused (Connection refused)
2021-05-04 18:17:43 [main] SEVERE: Shutting down.
2021-05-04 18:17:43 [main] INFO: Shutting down CA subsystem

journalctl --unit dirsrv@SEMI-CASHSTAR-NET.service
May 04 18:17:42 ipa01.semi.cashstar.net systemd[1]: Starting 389 Directory Server SEMI-CASHSTAR-NET....
May 04 18:17:44 ipa01.semi.cashstar.net ns-slapd[893792]: [04/May/2021:18:17:44.029650333 +0000] - INFO - slapd_extract_cert - CA CERT NAME: SEMI.CASHSTAR.NET IPA CA

I ran this to verify:
while true; do date; netstat -pan | egrep "636|389" | grep " LISTEN "; sleep 1; done
Tue May  4 18:17:40 UTC 2021
tcp6       0      0 :::636                  :::*                    LISTEN      893452/ns-slapd     
tcp6       0      0 :::389                  :::*                    LISTEN      893452/ns-slapd     
Tue May  4 18:17:41 UTC 2021
Tue May  4 18:17:42 UTC 2021
Tue May  4 18:17:43 UTC 2021
Tue May  4 18:17:44 UTC 2021
Tue May  4 18:17:45 UTC 2021
tcp6       0      0 :::636                  :::*                    LISTEN      893792/ns-slapd     
tcp6       0      0 :::389                  :::*                    LISTEN      893792/ns-slapd     

The LDAP directory server ns-slapd is not listening on port 636 at 18:17:43 because dirsrv@SEMI-CASHSTAR-NET.service is still starting up at that time. And at that moment, CA tries to connect to ns-slapd on port 636, the connection is refused, and CA shuts down. And then 12 seconds later, ipa-server-upgrade tries to connect to CA and gets an Internal Server Error response. I think PKI Tomcat, which got started by ipa-server-upgrade at 18:17:39, just so happened to start CA at 18:17:43, which is right in the middle of the wrong time. This looks like a race condition. A race condition which is reliably happening every time for me.

Comment 1 Paul Nickerson 2021-05-04 20:34:03 UTC
I figured out a workaround that works on my system.

Edit this file:
/usr/lib/python3.6/site-packages/ipaserver/install/server/upgrade.py

Near the top, add this:
import time

Find this line:
    logger.info('[Verifying that KDC configuration is using ipa-kdb backend]')
And add this right afterword:
    logger.info('But first, I sleep so that I don\'t stop 389 Directory Server LDAP ns-slapd while PKI Tomcat CA is trying to use it.')
    time.sleep(20)
    logger.info('I have woken.')

Save your changes, and run "ipactl restart" (or maybe ipa-server-upgrade would work).

On my system, 20 seconds worked, and 4 was not enough. The idea is that "systemctl stop dirsrv@SEMI-CASHSTAR-NET.service" should not run until "[main] INFO: CA subsystem started" shows up in the ca debug log. And waiting 20 seconds is one dirty way to get it done.