Created attachment 1698514 [details] tarball of /var/log from the failed test We have a test in openQA which does this: 1. Boot the previous release to the one being tested (in this case, Fedora 32) and deploy the system as a FreeIPA server 2. Similarly boot a second system to the previous release and deploy it as a FreeIPA client 3. Upgrade the server system to the release being tested (in this case, Rawhide); check that IPA is still running after the upgrade 4. Upgrade the client as well 5. Run a series of checks to make sure everything works after the upgrade Up till Fedora-Rawhide-20200622.n.0 this test was passing. Since Fedora-Rawhide-20200623.n.0 today, it's failing. It fails at step 3: the system upgrade of the server fails, but ipa.service then fails to start. The system logs show that ipa-server-upgrade fails, apparently due to issues with the certificate server (dogtag) and/or the LDAP server (389-ds). I'm attaching a tarball of /var/log . We're filing this against Java because of all the things that changed in the Fedora-Rawhide-20200623.n.0 compose it seems like the most likely suspect. There were a *lot* of at least *potentially* related things that changed, though. Aside from java-1.8.0-openjdk, other things that changed include bind, glibc, the kernel and openssl. Having looked at them, we don't *think* those changes are likely to be the cause, though. This violates the Beta criteria: "It must be possible to successfully complete a direct upgrade from a fully updated installation of each of the last two stable Fedora Server releases with the system configured as a FreeIPA domain controller or postgresql server as specified in the relevant criteria. The upgraded system must meet all relevant release criteria, including criteria relating to functionality of the server software" - https://fedoraproject.org/wiki/Fedora_33_Beta_Release_Criteria#Server_upgrade_requirements , so proposing as a Beta blocker.
This seems to be caused by tomcat@pki failing to start. Exception I see in the log is: 2020-06-23 11:09:33 [https-jsse-nio-8443-exec-6] WARNING: Invalid base64: [ipara]: java.lang.IllegalArgumentException: Last unit does not have enough valid bits java.lang.IllegalArgumentException: Last unit does not have enough valid bits at java.util.Base64$Decoder.decode0(Base64.java:734) at java.util.Base64$Decoder.decode(Base64.java:526) at java.util.Base64$Decoder.decode(Base64.java:549) at org.mozilla.jss.netscape.security.util.Utils.base64decode(Utils.java:417) at org.dogtagpki.server.authentication.AuthToken.getInByteArray(AuthToken.java:113) at org.dogtagpki.server.authentication.AuthToken.getInStringArray(AuthToken.java:198) at com.netscape.cms.servlet.cert.CertProcessor.setAuthTokenIntoRequest(CertProcessor.java:132) at com.netscape.cms.servlet.cert.CertProcessor.populateRequests(CertProcessor.java:345) at com.netscape.cms.servlet.cert.EnrollmentProcessor.processEnrollment(EnrollmentProcessor.java:189) at com.netscape.cms.servlet.cert.EnrollmentProcessor.processEnrollment(EnrollmentProcessor.java:97) at com.netscape.cms.servlet.profile.ProfileSubmitServlet.processEnrollment(ProfileSubmitServlet.java:276) at com.netscape.cms.servlet.profile.ProfileSubmitServlet.process(ProfileSubmitServlet.java:130) at com.netscape.cms.servlet.base.CMSServlet.service(CMSServlet.java:493) at javax.servlet.http.HttpServlet.service(HttpServlet.java:741) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:282) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:279) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAsPrivileged(Subject.java:549) at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:314) at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:170) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:225) at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:47) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:149) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:145) at java.security.AccessController.doPrivileged(Native Method) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:282) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:279) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAsPrivileged(Subject.java:549) at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:314) at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:253) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:191) at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:47) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:149) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:145) at java.security.AccessController.doPrivileged(Native Method) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541) at com.netscape.cms.tomcat.ExternalAuthenticationValve.invoke(ExternalAuthenticationValve.java:82) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:690) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:373) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748)
From /var/log/messages: Jun 23 11:31:50 ipa001 ipactl[717]: CalledProcessError: CalledProcessError(Command ['/bin/systemctl', 'start', 'pki-tomcatd'] returned non-zero exit status 1: 'Job for pki-tomcatd failed because a timeout was exceeded.\nSee "systemctl status pki-tomcatd" and "journalctl -xe" for details.\n') And from /var/log/ipaupgrade.log 2020-06-23T15:28:46Z DEBUG Starting external process 2020-06-23T15:28:46Z DEBUG args=['/usr/sbin/tomcat', 'version'] 2020-06-23T15:28:47Z DEBUG Process finished, return code=0 2020-06-23T15:28:47Z DEBUG stdout=Server version: Apache Tomcat/9.0.36 Server built: Dec 12 1969 11:45:33 UTC Server number: 9.0.36.0 OS Name: Linux OS Version: 5.8.0-0.rc2.20200622git625d3449788f.1.fc33.x86_64 Architecture: amd64 JVM Version: 1.8.0_262-ea-b01 JVM Vendor: Oracle Corporation 2020-06-23T15:28:47Z DEBUG stderr= 2020-06-23T15:28:47Z DEBUG Starting external process 2020-06-23T15:28:47Z DEBUG args=['/bin/systemctl', 'start', 'pki-tomcatd'] 2020-06-23T15:31:50Z DEBUG Process finished, return code=1 2020-06-23T15:31:50Z DEBUG stdout= 2020-06-23T15:31:50Z DEBUG stderr=Job for pki-tomcatd failed because a timeout was exceeded. See "systemctl status pki-tomcatd" and "journalctl -xe" for details. 2020-06-23T15:31:50Z ERROR IPA server upgrade failed: Inspect /var/log/ipaupgrade.log and run command ipa-server-upgrade manually. I also see this in /var/log/pki/pki-tomcat/ca/debug*.log: 2020-06-23 11:07:16 [https-jsse-nio-8443-exec-3] SEVERE: Object certificate not found. Error Certificate not found: caSigningCert cert-pki-ca: Certificate not found: caSigningCert cert-pki-ca Certificate not found: caSigningCert cert-pki-ca: Certificate not found: caSigningCert cert-pki-ca at com.netscape.ca.SigningUnit.init(SigningUnit.java:163) at com.netscape.ca.CertificateAuthority.initSigUnit(CertificateAuthority.java:1741) at com.netscape.ca.CertificateAuthority.init(CertificateAuthority.java:540) at com.netscape.cmscore.apps.CMSEngine.reinit(CMSEngine.java:1014) at org.dogtagpki.server.ca.CAConfigurator.reinitSubsystems(CAConfigurator.java:170) at com.netscape.cms.servlet.csadmin.Configurator.setupDatabase(Configurator.java:447) at org.dogtagpki.server.ca.CAConfigurator.setupDatabase(CAConfigurator.java:76) at org.dogtagpki.server.rest.SystemConfigService.setupDatabase(SystemConfigService.java:137) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:140) at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:295) at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:249) at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:236) at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:406) at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:213) at org.jboss.resteasy.plugins.server.servlet.ServletContainerDispatcher.service(ServletContainerDispatcher.java:228) at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:56) at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:51) at javax.servlet.http.HttpServlet.service(HttpServlet.java:741) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:282) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:279) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAsPrivileged(Subject.java:549) at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:314) at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:170) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:225) at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:47) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:149) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:145) at java.security.AccessController.doPrivileged(Native Method) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:282) at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:279) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAsPrivileged(Subject.java:549) at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:314) at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:253) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:191) at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:47) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:149) at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:145) at java.security.AccessController.doPrivileged(Native Method) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541) at com.netscape.cms.tomcat.ExternalAuthenticationValve.invoke(ExternalAuthenticationValve.java:82) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:690) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:373) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.mozilla.jss.crypto.ObjectNotFoundException: Certificate not found: caSigningCert cert-pki-ca at org.mozilla.jss.CryptoManager.findCertByNicknameNative(Native Method) at org.mozilla.jss.CryptoManager.findCertByNickname(CryptoManager.java:865) at com.netscape.ca.SigningUnit.init(SigningUnit.java:159) ... 72 more
I don't see anything attributing this to OpenJDK. There seem to be failures throughout the system e.g. Jun 23 11:09:46 ipa001 certmonger[10065]: 2020-06-23 11:09:46 [10065] error:0D06407A:asn1 encoding routines:a2d_ASN1_OBJECT:first num too large Jun 23 11:09:53 ipa001 certmonger[10105]: Error obtaining initial credentials: Generic error (see e-text). Jun 23 11:09:53 ipa001 certmonger[10105]: Error setting up ccache at the client: Generic error (see e-text). Jun 23 11:11:53 ipa001 dnf[11039]: - Curl error (6): Couldn't resolve host name for https://dl.fedoraproject.org/pub/fedora/linux/releases/32/Modular/x86_64/os/repodata/repomd.xml [Could not resolve host: dl.fedoraproject.org] Jun 23 11:16:37 ipa001 named-pkcs11[10722]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed Jun 23 11:16:37 ipa001 named-pkcs11[10722]: ldap_syncrepl will reconnect in 60 seconds Jun 23 11:28:15 ipa001 setroubleshootd[797]: error: Unable to open sqlite database /var/lib/rpm/rpmdb.sqlite: unable to open database file
Hum. The test passed again today. It's definitely strange that it failed twice in a row exactly the same way with yesterday's compose, but with today's it worked, at least. I'll keep an eye on it and investigate more if it breaks again.
FWIW, today's test did log a *ton* of AVCs, which may have been related to the failure. I'll file a separate selinux-policy bug for these, but you can see them at https://openqa.fedoraproject.org/tests/620050/file/_console_avc_crash-avcs.txt .
Well, since I last updated this, the test has failed again every time it's run: https://openqa.fedoraproject.org/tests/621392#next_previous ditto a nearly-identical test which upgrades from F31 (rather than F32): https://openqa.fedoraproject.org/tests/621355#next_previous so there is definitely still some kind of problem here :( Each test failure has its logs attached on the Logs & Assets tab, so we can poke through to try and find out some common thread. Note the logs will get garbage collected after a couple of weeks, if we don't have this figured out soon I'll try and remember to attach a few more samples.
I tried a run with SELinux in permissive mode, but still failed: https://openqa.fedoraproject.org/tests/621579
Really having trouble getting a handle on this. It seems like the problem is tomcat, but I can't really find anything to say *why*. It starts to start up, then...just...doesn't: [adamw@adam tmp]$ journalctl --file var-new/log/journal/153ded1a9d5a467b846b7bbd9d256fa9/system.journal -b-1 -u pki-tomcatd --no-pager -- Logs begin at Tue 2020-06-30 13:15:34 PDT, end at Tue 2020-06-30 13:50:01 PDT. -- Jun 30 13:40:55 ipa001.domain.local systemd[1]: Starting PKI Tomcat Server pki-tomcat... Jun 30 13:40:57 ipa001.domain.local server[13559]: Java virtual machine used: /usr/lib/jvm/jre-1.8.0-openjdk/bin/java Jun 30 13:40:57 ipa001.domain.local server[13559]: classpath used: /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/lib/java/commons-daemon.jar Jun 30 13:40:57 ipa001.domain.local server[13559]: main class used: org.apache.catalina.startup.Bootstrap Jun 30 13:40:57 ipa001.domain.local server[13559]: flags used: Jun 30 13:40:57 ipa001.domain.local server[13559]: options used: -Dcatalina.base=/var/lib/pki/pki-tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/lib/pki/pki-tomcat/temp -Djava.util.logging.config.file=/var/lib/pki/pki-tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.security.manager -Djava.security.policy==/var/lib/pki/pki-tomcat/conf/catalina.policy Jun 30 13:40:57 ipa001.domain.local server[13559]: arguments used: start Jun 30 13:40:57 ipa001.domain.local ipa-pki-wait-running[13560]: pki.client: /usr/libexec/ipa/ipa-pki-wait-running:60: The subsystem in PKIConnection.__init__() has been deprecated (https://www.dogtagpki.org/wiki/PKI_10.8_Python_Changes). Jun 30 13:40:57 ipa001.domain.local ipa-pki-wait-running[13560]: ipa-pki-wait-running: Created connection http://ipa001.domain.local:8080/ca Jun 30 13:40:57 ipa001.domain.local ipa-pki-wait-running[13560]: ipa-pki-wait-running: Connection failed: HTTPConnectionPool(host='ipa001.domain.local', port=8080): Max retries exceeded with url: /ca/admin/ca/getStatus (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2546d9caf0>: Failed to establish a new connection: [Errno 111] Connection refused')) Jun 30 13:40:58 ipa001.domain.local java[13559]: usr/lib/api/apiutil.c Could not open /run/lock/opencryptoki/LCK..APIlock Jun 30 13:40:58 ipa001.domain.local ipa-pki-wait-running[13560]: ipa-pki-wait-running: Connection failed: HTTPConnectionPool(host='ipa001.domain.local', port=8080): Max retries exceeded with url: /ca/admin/ca/getStatus (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2546d9cf70>: Failed to establish a new connection: [Errno 111] Connection refused')) Jun 30 13:40:59 ipa001.domain.local ipa-pki-wait-running[13560]: ipa-pki-wait-running: Connection failed: HTTPConnectionPool(host='ipa001.domain.local', port=8080): Max retries exceeded with url: /ca/admin/ca/getStatus (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2546d57370>: Failed to establish a new connection: [Errno 111] Connection refused')) eventually ipa-pki-wait-running gets bored and times out.
Hum, well, here's a thing: looking more closely at the history of the two relevant tests, I think this started happening earlier than I thought. The test failed twice in Fedora-Rawhide-20200601.n.1 in a way that looks very similar: https://openqa.fedoraproject.org/tests/609532 https://openqa.fedoraproject.org/tests/609584 and there are periodic failures since then. The F31 -> Rawhide test failed in the same module in Fedora-Rawhide-20200529.n.2 and even in Fedora-Rawhide-20200516.n.0 ; those logs have been garbage collected now so I can't tell for sure if they were the same. I'll attach the logs from the 2020-06-01 failure before they get GCed...
Created attachment 1699394 [details] tarball of /var/log from the failed test on 2020-06-01
so the test has now passed every day for the last week. I officially give up. If it keeps passing next week I guess I'll just close the bug again.
This kept passing for a while, now it's failing again, but it's because of a known bug https://bugzilla.redhat.com/show_bug.cgi?id=1857043 . So I'll close this again for now.