Upgrade logging from 4.5/4.6 released version to the latest 4.6 always fails: CSV versions: clusterlogging.4.6.0-202101090741.p0, elasticsearch-operator.4.6.0-202101090040.p0 Description of problem: Deploy released logging 4.5/4.6, then upgrade logging to latest 4.6, the elasticsearch container couldn't start after upgrading. cl/instance: spec: collection: logs: fluentd: {} type: fluentd logStore: elasticsearch: nodeCount: 3 redundancyPolicy: SingleRedundancy resources: requests: memory: 2Gi storage: size: 20Gi storageClassName: standard retentionPolicy: application: maxAge: 60h audit: maxAge: 1d infra: maxAge: 3h type: elasticsearch managementState: Managed visualization: kibana: replicas: 1 type: kibana $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-685cc5d7c4-s66g6 1/1 Running 0 15m elasticsearch-cdm-cizg2js1-1-5c99cdf7c9-9pqlv 1/2 Running 0 14m elasticsearch-cdm-cizg2js1-2-79d856d7f-f8x5t 2/2 Running 0 25m elasticsearch-cdm-cizg2js1-3-8d58b79d-z9jbg 2/2 Running 0 25m elasticsearch-delete-app-1610335800-lcgss 0/1 Completed 0 17m elasticsearch-delete-app-1610336700-blwss 0/1 Error 0 2m6s elasticsearch-delete-audit-1610335800-mt9r2 0/1 Completed 0 17m elasticsearch-delete-audit-1610336700-fjrps 0/1 Error 0 2m6s elasticsearch-delete-infra-1610335800-wgzj9 0/1 Completed 0 17m elasticsearch-delete-infra-1610336700-2nhfm 0/1 Error 0 2m6s elasticsearch-rollover-app-1610335800-v2rrr 0/1 Completed 0 17m elasticsearch-rollover-app-1610336700-24pns 0/1 Error 0 2m6s elasticsearch-rollover-audit-1610335800-f7qw4 0/1 Completed 0 17m elasticsearch-rollover-audit-1610336700-dgt4s 0/1 Error 0 2m5s elasticsearch-rollover-infra-1610335800-bvb4k 0/1 Completed 0 17m elasticsearch-rollover-infra-1610336700-mnn7r 0/1 Error 0 2m5s fluentd-9p8t6 1/1 Running 0 25m fluentd-dcnmb 1/1 Running 0 25m fluentd-lgqxf 1/1 Running 0 25m fluentd-rdmn7 1/1 Running 0 25m fluentd-rf9kq 0/1 Init:CrashLoopBackOff 7 14m fluentd-x4pnr 1/1 Running 0 25m kibana-6d95c5bf74-znx7n 2/2 Running 0 14m Lots of error in the ES pod: [2021-01-11T03:39:17,114][WARN ][o.e.t.OutboundHandler ] [elasticsearch-cdm-cizg2js1-1] send message failed [channel: Netty4TcpChannel{localAddress=/10.131.0.188:47828, remoteAddress=elasticsearch-cluster.openshift-logging.svc/172.30.178.14:9300}] javax.net.ssl.SSLException: SSLEngine closed already at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?] [2021-01-11T03:39:17,116][ERROR][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [elasticsearch-cdm-cizg2js1-1] SSL Problem PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?] at sun.security.ssl.TransportContext.fatal(TransportContext.java:350) ~[?:?] at sun.security.ssl.TransportContext.fatal(TransportContext.java:293) ~[?:?] at sun.security.ssl.TransportContext.fatal(TransportContext.java:288) ~[?:?] at sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1356) ~[?:?] at sun.security.ssl.CertificateMessage$T13CertificateConsumer.onConsumeCertificate(CertificateMessage.java:1231) ~[?:?] at sun.security.ssl.CertificateMessage$T13CertificateConsumer.consume(CertificateMessage.java:1174) ~[?:?] at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392) ~[?:?] at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444) ~[?:?] at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1074) ~[?:?] at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1061) ~[?:?] at java.security.AccessController.doPrivileged(Native Method) ~[?:?] at sun.security.ssl.SSLEngineImpl$DelegatedTask.run(SSLEngineImpl.java:1008) ~[?:?] at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1464) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final] at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1369) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final] at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final] at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final] at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final] at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final] at java.lang.Thread.run(Thread.java:834) [?:?] Caused by: sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:369) ~[?:?] at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:275) ~[?:?] at sun.security.validator.Validator.validate(Validator.java:264) ~[?:?] at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:313) ~[?:?] at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:276) ~[?:?] at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:141) ~[?:?] at sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1334) ~[?:?] ... 29 more Caused by: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:158) ~[?:?] at sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(PKIXCertPathValidator.java:84) ~[?:?] at java.security.cert.CertPathValidator.validate(CertPathValidator.java:309) ~[?:?] at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:364) ~[?:?] at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:275) ~[?:?] at sun.security.validator.Validator.validate(Validator.java:264) ~[?:?] at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:313) ~[?:?] at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:276) ~[?:?] at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:141) ~[?:?] at sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1334) ~[?:?] ... 29 more [2021-01-11T03:39:19,095][WARN ][o.e.d.z.ZenDiscovery ] [elasticsearch-cdm-cizg2js1-1] not enough master nodes discovered during pinging (found [[Candidate{node={elasticsearch-cdm-cizg2js1-1}{P2Yxe5sqSRyMr6lTszJVng}{zz002tNNQ7qcwHPdRxKkFQ}{10.131.0.188}{10.131.0.188:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again [2021-01-11T03:39:19,115][WARN ][o.e.t.OutboundHandler ] [elasticsearch-cdm-cizg2js1-1] send message failed [channel: Netty4TcpChannel{localAddress=/10.131.0.188:47842, remoteAddress=elasticsearch-cluster.openshift-logging.svc/172.30.178.14:9300}] javax.net.ssl.SSLException: SSLEngine closed already at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?] [2021-01-11T03:39:19,116][ERROR][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [elasticsearch-cdm-cizg2js1-1] SSL Problem PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors The EO keeps repeating the following error messages: {"level":"error","ts":1610336026.420805,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"failed to create index template","mapping":"app","error":"Put \"https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"} {"level":"error","ts":1610336334.8442912,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:58","msg":"Unable to clear transient shard allocation","cluster":"elasticsearch","namespace":"openshift-logging","error":"Response: , Error: Put \"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"} {"level":"info","ts":1610336334.9022033,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:64","msg":"Unable to check if threshold is enabled","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"} {"level":"error","ts":1610336335.9737287,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"} {"level":"info","ts":1610336336.0337365,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:64","msg":"Unable to check if threshold is enabled","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"} {"level":"error","ts":1610336336.1035793,"logger":"elasticsearch-operator","caller":"k8shandler/index_management.go:81","msg":"Unable to list existing templates in order to reconcile stale ones","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_template\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"} Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. deploy released logging 4.5/4.6 by subscribing from the catalogsource `redhat-operators` 2. upgrade logging to latest 4.6 3. Actual results: Expected results: Additional info:
Sorry. big change -> big chance.
Created attachment 1746137 [details] must-gather Hit same issue when upgrade logging from released 4.6(4.6.0-202011221454.p0) to latest 4.7(4.7.0-202101092121.p0), details are in the attachment.
We should be able to work around this issue by restarting the ES pods (oc delete pods -l component=elasticsearch) so the operator and ES cluster are utilizing the same certificates
(In reply to Qiaoling Tang from comment #14) > Tested in clusterlogging.4.6.0-202101162152.p0, here are the steps and > > 3. upgrade from 4.5 to 4.6: upgrade CLO and EO at the same time: > need to do the workaround (oc delete pods -l component=elasticsearch) twice, > then the upgrade can be successful > > 4. upgrade from 4.6 to 4.6, upgrade EO and CLO at the same time > > upgrade from released version(clusterlogging.4.6.0-202011221454.p0) to > latest 4.6: need to do the workaround(oc delete pods -l > component=elasticsearch), then the logging stack could be upgraded These issues are likely the same and we may need to get the change into 4.5. I'm betting the lower CLO that does not have the fix and is recreating the certs which causes issues when EO tries to upgrade. The cases you identify having the issues are all ones going from non-fix to fixed. The successful case is the one listed below which has fix to fix version. > > upgrade from non-released version to latest 4.6: succeeded > > > @Jeff, > my concern to verify this bz is: if customers set `installPlanApproval: > Automatic` in their subscriptions when deploy logging 4.6, after we release > new logging 4.6, the CLO and EO will be upgraded at the same time, then the > customers need to do the workaround, will the customers accept the > workaround? A can't comment if customer's will accept the workaround though it would be desirable to resolve this for them. We likely need to get the change into 4.5 too as I believe we backported earlier cert changes to 4.5
I believe the correct solution is to release https://github.com/openshift/cluster-logging-operator/pull/858 first which will bring in all the cert changes to 4.5 to eliminate the CLO from prematurely regenerating the certs
Scenarios 1: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO First Result: Pass Scenarios 2: upgrade logging from 4.6 release to 4.6 latest -- Upgrade EO First Result: Pass Scenarios 3: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO and EO at same time Result: Fail The pull/858 is for 4.5. I don't think that can fix the Scenarios 3.
(In reply to Anping Li from comment #18) > Scenarios 1: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO > First Result: Pass > Scenarios 2: upgrade logging from 4.6 release to 4.6 latest -- Upgrade EO > First Result: Pass > Scenarios 3: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO > and EO at same time Result: Fail > > The pull/858 is for 4.5. I don't think that can fix the Scenarios 3. You are correct. It will only fix a 4.5 to 4.6 latest assuming 4.5 has the cert change. For 4.6 to 4.6 latest it will not and the only possibility is to only upgrade EO and allow it to settle before CLO or manually intervene to remove the pods
Given #c18 where it looks like we have no good options, we may still hold this PR until 4.5 lands to resolve those cases but it might benefit users on 4.6. I defer to QE here
*** Bug 1916911 has been marked as a duplicate of this bug. ***
*** Bug 1918441 has been marked as a duplicate of this bug. ***
@Anping moving this back to ONQA as there are no good options to address your finding other then to document the work around. The cert changes are partially in 4.6 and this change will correct them but any earlier deployments will not have them. There is a jira issue filed to guard against this but it would need to be backported and would otherwise hold up 4.6 changes in general. I'm not certain how to best convey upgrading should be to: 1. Upgrade CLO followed by EO (or visa versa) 2. If issue, oc delete pod -l component=es to restart the es pods to resolve the issue
@jeff,How the customer/support team can know the workaround before they hit this issue?
Move to verified, the doc include it in the z-stream update for 4.6.13
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.13 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0173