Bug 1956991

Summary: During ipa-server-upgrade, CA tries to connect to ns-slapd right after directory server is restarted, and fails
Product: Red Hat Enterprise Linux 8 Reporter: Paul Nickerson <pnickerson>
Component: ipaAssignee: Florence Blanc-Renaud <frenaud>
Status: CLOSED WONTFIX QA Contact: ipa-qe <ipa-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.3CC: abokovoy, fcami, frenaud, ftrivino, ipa-maint, jcholast, jhrozek, mhjacks, msauton, pvoborni, rcritten, sjansen, ssorce, tbordaz, tscherf, twoerner
Target Milestone: betaKeywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-04 07:27:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs and command outputs none

Description Paul Nickerson 2021-05-04 19:49:48 UTC
Created attachment 1779495 [details]
Logs and command outputs

Description of problem:
When ipaupgrade runs, PKI Tomcat's CA tries to connect to 389 Directory Server's ns-slapd (LDAP) and fails with connection refused because ns-slapd happens to be restarting in that 4 second time window. CA dies at this moment. Later on, ipaupgrade tries to connect to CA and fails with internal server error, and errors out.

Version-Release number of selected component (if applicable):
ipa --version: VERSION: 4.8.7, API_VERSION: 2.239
Red Hat Enterprise Linux 8.3 (Ootpa)

How reproducible:
Happens every time I try to run "ipactl restart". Haven't tried this on a fresh system.

Steps to Reproduce:
1. ipactl restart

Actual results:
Failure reported and all services are stopped.

Expected results:
Success reported and all services are running.

Additional info:

I had need to run "ipactl restart", and got this message:
IPA version error: data needs to be upgraded (expected version '4.8.7-16.module+el8.3.0+10289+b6566038', current version '4.8.7-13.module+el8.3.0+8376+0bba7131')
Automatically running upgrade, for details see /var/log/ipaupgrade.log
Be patient, this may take a few minutes.

After a little time, this completes with failure:
[Migrating certificate profiles to LDAP]
IPA server upgrade failed: Inspect /var/log/ipaupgrade.log and run command ipa-server-upgrade manually.
Unexpected error - see /var/log/ipaupgrade.log for details:
RemoteRetrieveError: Failed to authenticate to CA REST API

/var/log/ipaupgrade.log
2021-05-04T18:17:39Z DEBUG SUCCESS: port: 8443
2021-05-04T18:17:39Z DEBUG Start of pki-tomcatd complete
2021-05-04T18:17:40Z DEBUG args=['/bin/systemctl', 'stop', 'dirsrv']
2021-05-04T18:17:42Z DEBUG args=['/bin/systemctl', 'start', 'dirsrv']
2021-05-04T18:17:45Z DEBUG args=['/bin/systemctl', 'is-active', 'dirsrv']
2021-05-04T18:17:55Z DEBUG request GET https://ipa01.semi.cashstar.net:8443/ca/rest/account/login
2021-05-04T18:17:55Z DEBUG request body ''
2021-05-04T18:17:55Z DEBUG response status 500
CA subsystem unavailable. Check CA debug log.

/var/log/pki/pki-tomcat/ca/debug.2021-05-04.log
2021-05-04 18:17:43 [main] INFO: Initializing CA subsystem
2021-05-04 18:17:43 [main] FINE: LdapBoundConnection: Connecting to ipa01.semi.cashstar.net:636 with client cert auth
2021-05-04 18:17:43 [main] SEVERE: Unable to create socket: java.net.ConnectException: Connection refused (Connection refused)
2021-05-04 18:17:43 [main] SEVERE: Shutting down.
2021-05-04 18:17:43 [main] INFO: Shutting down CA subsystem

journalctl --unit dirsrv
May 04 18:17:42 ipa01.semi.cashstar.net systemd[1]: Starting 389 Directory Server SEMI-CASHSTAR-NET....
May 04 18:17:44 ipa01.semi.cashstar.net ns-slapd[893792]: [04/May/2021:18:17:44.029650333 +0000] - INFO - slapd_extract_cert - CA CERT NAME: SEMI.CASHSTAR.NET IPA CA

I ran this to verify:
while true; do date; netstat -pan | egrep "636|389" | grep " LISTEN "; sleep 1; done
Tue May  4 18:17:40 UTC 2021
tcp6       0      0 :::636                  :::*                    LISTEN      893452/ns-slapd     
tcp6       0      0 :::389                  :::*                    LISTEN      893452/ns-slapd     
Tue May  4 18:17:41 UTC 2021
Tue May  4 18:17:42 UTC 2021
Tue May  4 18:17:43 UTC 2021
Tue May  4 18:17:44 UTC 2021
Tue May  4 18:17:45 UTC 2021
tcp6       0      0 :::636                  :::*                    LISTEN      893792/ns-slapd     
tcp6       0      0 :::389                  :::*                    LISTEN      893792/ns-slapd     

The LDAP directory server ns-slapd is not listening on port 636 at 18:17:43 because dirsrv is still starting up at that time. And at that moment, CA tries to connect to ns-slapd on port 636, the connection is refused, and CA shuts down. And then 12 seconds later, ipa-server-upgrade tries to connect to CA and gets an Internal Server Error response. I think PKI Tomcat, which got started by ipa-server-upgrade at 18:17:39, just so happened to start CA at 18:17:43, which is right in the middle of the wrong time. This looks like a race condition. A race condition which is reliably happening every time for me.

Comment 1 Paul Nickerson 2021-05-04 20:34:03 UTC
I figured out a workaround that works on my system.

Edit this file:
/usr/lib/python3.6/site-packages/ipaserver/install/server/upgrade.py

Near the top, add this:
import time

Find this line:
    logger.info('[Verifying that KDC configuration is using ipa-kdb backend]')
And add this right afterword:
    logger.info('But first, I sleep so that I don\'t stop 389 Directory Server LDAP ns-slapd while PKI Tomcat CA is trying to use it.')
    time.sleep(20)
    logger.info('I have woken.')

Save your changes, and run "ipactl restart" (or maybe ipa-server-upgrade would work).

On my system, 20 seconds worked, and 4 was not enough. The idea is that "systemctl stop dirsrv" should not run until "[main] INFO: CA subsystem started" shows up in the ca debug log. And waiting 20 seconds is one dirty way to get it done.

Comment 6 sjansen 2021-12-16 17:50:23 UTC
I can confirm this issue on several installations, after latest IPA server update (RHEL 8.5 and also on some clones) the CA is dead after an upgrade. Web GUI "show Unable to communicate with CMS (403)"

Comment 7 Florence Blanc-Renaud 2022-04-06 17:31:04 UTC
(In reply to sjansen from comment #6)
> I can confirm this issue on several installations, after latest IPA server
> update (RHEL 8.5 and also on some clones) the CA is dead after an upgrade.
> Web GUI "show Unable to communicate with CMS (403)"

In your specific case, the error may be related to a different issue. Can you check which version of pki-servlet-engine is installed on your system + the content of /etc/pki/pki-tomcat/server.xml? You will find more information in https://access.redhat.com/solutions/6632811, but in short the update to RHEL 8.5 may have updated the file /etc/pki/pki-tomcat/server.xml while it should not.
With pki-servlet-engine-9.0.30 and below: /etc/pki/pki-tomcat/server.xml should contain requiredSecret=xxx in the section defining the AJP connector.
With pki-servlet-engine-9.0.31 and above: /etc/pki/pki-tomcat/server.xml should contain secret=xxx
This is already tracked in BZ #2029023 and BZ #2061458.

The issue described in this bugzilla is different and related to the startup of services and timeouts.

Comment 8 sjansen 2022-04-22 10:10:12 UTC
(In reply to Florence Blanc-Renaud from comment #7)

> In your specific case, the error may be related to a different issue. Can
> you check which version of pki-servlet-engine is installed on your system +
> the content of /etc/pki/pki-tomcat/server.xml? You will find more
> information in https://access.redhat.com/solutions/6632811, but in short the
> update to RHEL 8.5 may have updated the file /etc/pki/pki-tomcat/server.xml
> while it should not.
> With pki-servlet-engine-9.0.30 and below: /etc/pki/pki-tomcat/server.xml
> should contain requiredSecret=xxx in the section defining the AJP connector.
> With pki-servlet-engine-9.0.31 and above: /etc/pki/pki-tomcat/server.xml
> should contain secret=xxx
> This is already tracked in BZ #2029023 and BZ #2061458.
> 
> The issue described in this bugzilla is different and related to the startup
> of services and timeouts.

Hi Florance,

i am suprised i got tagged for this issue because its quite some time ago. Looks like this issue got moved around a bit, my orignal issue was in 2029023. From my side the issue got solved in 2029023 by editing server.xml to remove auth, it never reappeared on any of my freeipa servers or i never got another pki update since i changed server.xml.

Comment 10 RHEL Program Management 2022-11-04 07:27:47 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.