Bug 1906641 - Elasticsearch cluster can't be fully upgraded after upgrade logging from 4.5 to latest 4.6
Summary: Elasticsearch cluster can't be fully upgraded after upgrade logging from 4.5 ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: Jeff Cantrill
QA Contact: Qiaoling Tang
URL:
Whiteboard: logging-core
Depends On: 1905910
Blocks: 1915840
TreeView+ depends on / blocked
 
Reported: 2020-12-11 02:33 UTC by Qiaoling Tang
Modified: 2024-03-25 17:30 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Earlier change to certificate generation caused the CLO to improperly regenerate certificates Consequence: The CLO could possible regenerate certificates while the EO is trying to restart the cluster which results in the EO being unable to communicate to the cluster and the individual nodes unable to cluster between themselves because of mismatched certs. Fix: Properly store all certs in the master secret and properly extract them to the CLO's working directory. Result: During reconcilation the CLO will have all the certificates in the working directory and is able to properly evaluate them to see if they need to be regenerated. Since they should not have expired, the CLO will not regenerate them which will allow the EO to communicate to the ES cluster without certificates changes mid upgrade.
Clone Of:
Environment:
Last Closed: 2021-01-25 20:21:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather (6.95 MB, application/gzip)
2021-01-11 05:05 UTC, Qiaoling Tang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 849 0 None closed Bug 1906641: Correctly extract master-certs to working directory 2021-02-16 16:26:50 UTC
Red Hat Product Errata RHBA-2021:0173 0 None None None 2021-01-25 20:21:12 UTC

Description Qiaoling Tang 2020-12-11 02:33:57 UTC
Upgrade logging from 4.5/4.6 released version to the latest 4.6 always fails:

CSV versions: clusterlogging.4.6.0-202101090741.p0, elasticsearch-operator.4.6.0-202101090040.p0 

Description of problem:
Deploy released logging 4.5/4.6, then upgrade logging to latest 4.6, the elasticsearch container couldn't start after upgrading.

cl/instance:
  spec:
    collection:
      logs:
        fluentd: {}
        type: fluentd
    logStore:
      elasticsearch:
        nodeCount: 3
        redundancyPolicy: SingleRedundancy
        resources:
          requests:
            memory: 2Gi
        storage:
          size: 20Gi
          storageClassName: standard
      retentionPolicy:
        application:
          maxAge: 60h
        audit:
          maxAge: 1d
        infra:
          maxAge: 3h
      type: elasticsearch
    managementState: Managed
    visualization:
      kibana:
        replicas: 1
      type: kibana


$ oc get pod
NAME                                            READY   STATUS                  RESTARTS   AGE
cluster-logging-operator-685cc5d7c4-s66g6       1/1     Running                 0          15m
elasticsearch-cdm-cizg2js1-1-5c99cdf7c9-9pqlv   1/2     Running                 0          14m
elasticsearch-cdm-cizg2js1-2-79d856d7f-f8x5t    2/2     Running                 0          25m
elasticsearch-cdm-cizg2js1-3-8d58b79d-z9jbg     2/2     Running                 0          25m
elasticsearch-delete-app-1610335800-lcgss       0/1     Completed               0          17m
elasticsearch-delete-app-1610336700-blwss       0/1     Error                   0          2m6s
elasticsearch-delete-audit-1610335800-mt9r2     0/1     Completed               0          17m
elasticsearch-delete-audit-1610336700-fjrps     0/1     Error                   0          2m6s
elasticsearch-delete-infra-1610335800-wgzj9     0/1     Completed               0          17m
elasticsearch-delete-infra-1610336700-2nhfm     0/1     Error                   0          2m6s
elasticsearch-rollover-app-1610335800-v2rrr     0/1     Completed               0          17m
elasticsearch-rollover-app-1610336700-24pns     0/1     Error                   0          2m6s
elasticsearch-rollover-audit-1610335800-f7qw4   0/1     Completed               0          17m
elasticsearch-rollover-audit-1610336700-dgt4s   0/1     Error                   0          2m5s
elasticsearch-rollover-infra-1610335800-bvb4k   0/1     Completed               0          17m
elasticsearch-rollover-infra-1610336700-mnn7r   0/1     Error                   0          2m5s
fluentd-9p8t6                                   1/1     Running                 0          25m
fluentd-dcnmb                                   1/1     Running                 0          25m
fluentd-lgqxf                                   1/1     Running                 0          25m
fluentd-rdmn7                                   1/1     Running                 0          25m
fluentd-rf9kq                                   0/1     Init:CrashLoopBackOff   7          14m
fluentd-x4pnr                                   1/1     Running                 0          25m
kibana-6d95c5bf74-znx7n                         2/2     Running                 0          14m

Lots of error in the ES pod:

[2021-01-11T03:39:17,114][WARN ][o.e.t.OutboundHandler    ] [elasticsearch-cdm-cizg2js1-1] send message failed [channel: Netty4TcpChannel{localAddress=/10.131.0.188:47828, remoteAddress=elasticsearch-cluster.openshift-logging.svc/172.30.178.14:9300}]
javax.net.ssl.SSLException: SSLEngine closed already
	at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?]
[2021-01-11T03:39:17,116][ERROR][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [elasticsearch-cdm-cizg2js1-1] SSL Problem PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors
javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors
	at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:350) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:293) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:288) ~[?:?]
	at sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1356) ~[?:?]
	at sun.security.ssl.CertificateMessage$T13CertificateConsumer.onConsumeCertificate(CertificateMessage.java:1231) ~[?:?]
	at sun.security.ssl.CertificateMessage$T13CertificateConsumer.consume(CertificateMessage.java:1174) ~[?:?]
	at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392) ~[?:?]
	at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444) ~[?:?]
	at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1074) ~[?:?]
	at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1061) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
	at sun.security.ssl.SSLEngineImpl$DelegatedTask.run(SSLEngineImpl.java:1008) ~[?:?]
	at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1464) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1369) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors
	at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:369) ~[?:?]
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:275) ~[?:?]
	at sun.security.validator.Validator.validate(Validator.java:264) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:313) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:276) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:141) ~[?:?]
	at sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1334) ~[?:?]
	... 29 more
Caused by: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors
	at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:158) ~[?:?]
	at sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(PKIXCertPathValidator.java:84) ~[?:?]
	at java.security.cert.CertPathValidator.validate(CertPathValidator.java:309) ~[?:?]
	at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:364) ~[?:?]
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:275) ~[?:?]
	at sun.security.validator.Validator.validate(Validator.java:264) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:313) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:276) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:141) ~[?:?]
	at sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1334) ~[?:?]
	... 29 more
[2021-01-11T03:39:19,095][WARN ][o.e.d.z.ZenDiscovery     ] [elasticsearch-cdm-cizg2js1-1] not enough master nodes discovered during pinging (found [[Candidate{node={elasticsearch-cdm-cizg2js1-1}{P2Yxe5sqSRyMr6lTszJVng}{zz002tNNQ7qcwHPdRxKkFQ}{10.131.0.188}{10.131.0.188:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2021-01-11T03:39:19,115][WARN ][o.e.t.OutboundHandler    ] [elasticsearch-cdm-cizg2js1-1] send message failed [channel: Netty4TcpChannel{localAddress=/10.131.0.188:47842, remoteAddress=elasticsearch-cluster.openshift-logging.svc/172.30.178.14:9300}]
javax.net.ssl.SSLException: SSLEngine closed already
	at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?]
[2021-01-11T03:39:19,116][ERROR][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [elasticsearch-cdm-cizg2js1-1] SSL Problem PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors
javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors


The EO keeps repeating the following error messages:
{"level":"error","ts":1610336026.420805,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"failed to create index template","mapping":"app","error":"Put \"https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"}
{"level":"error","ts":1610336334.8442912,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:58","msg":"Unable to clear transient shard allocation","cluster":"elasticsearch","namespace":"openshift-logging","error":"Response: , Error: Put \"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"}
{"level":"info","ts":1610336334.9022033,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:64","msg":"Unable to check if threshold is enabled","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"}
{"level":"error","ts":1610336335.9737287,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"}
{"level":"info","ts":1610336336.0337365,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:64","msg":"Unable to check if threshold is enabled","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"}
{"level":"error","ts":1610336336.1035793,"logger":"elasticsearch-operator","caller":"k8shandler/index_management.go:81","msg":"Unable to list existing templates in order to reconcile stale ones","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_template\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-cluster-logging-signer\")"}


Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. deploy released logging 4.5/4.6 by subscribing from the catalogsource `redhat-operators`
2. upgrade logging to latest 4.6
3.

Actual results:


Expected results:


Additional info:

Comment 3 Anping Li 2020-12-18 06:37:34 UTC
Sorry.   big change  -> big chance.

Comment 4 Qiaoling Tang 2021-01-11 05:05:06 UTC
Created attachment 1746137 [details]
must-gather

Hit same issue when upgrade logging from released 4.6(4.6.0-202011221454.p0) to latest 4.7(4.7.0-202101092121.p0), details are in the attachment.

Comment 12 Jeff Cantrill 2021-01-13 16:21:30 UTC
We should be able to work around this issue by restarting the ES pods (oc delete pods -l component=elasticsearch) so the operator and ES cluster are utilizing the same certificates

Comment 16 Jeff Cantrill 2021-01-20 14:29:44 UTC
(In reply to Qiaoling Tang from comment #14)
> Tested in clusterlogging.4.6.0-202101162152.p0, here are the steps and
> 
> 3. upgrade from 4.5 to 4.6: upgrade CLO and EO at the same time:
> need to do the workaround (oc delete pods -l component=elasticsearch) twice,
> then the upgrade can be successful
> 
> 4. upgrade from 4.6 to 4.6, upgrade EO and CLO at the same time
> 
> upgrade from released version(clusterlogging.4.6.0-202011221454.p0) to
> latest 4.6: need to do the workaround(oc delete pods -l
> component=elasticsearch), then the logging stack could be upgraded

These issues are likely the same and we may need to get the change into 4.5.  I'm betting the lower CLO that does not have the fix and is recreating the certs which causes issues when EO tries to upgrade.  The cases you identify having the issues are all ones going from non-fix to fixed.  The successful case  is the one listed below which has fix to fix version.

> 
> upgrade from non-released version to latest 4.6: succeeded
> 
> 
> @Jeff,
> my concern to verify this bz is: if customers set `installPlanApproval:
> Automatic` in their subscriptions when deploy logging 4.6, after we release
> new logging 4.6, the CLO and EO will be upgraded at the same time, then the
> customers need to do the workaround, will the customers accept the
> workaround?

A can't comment if customer's will accept the workaround though it would be desirable to resolve this for them.  We likely need to get the change into 4.5 too as I believe we backported earlier cert changes to 4.5

Comment 17 Jeff Cantrill 2021-01-20 14:46:29 UTC
I believe the correct solution is to release https://github.com/openshift/cluster-logging-operator/pull/858 first which will bring in all the cert changes to 4.5 to eliminate the CLO from prematurely regenerating the certs

Comment 18 Anping Li 2021-01-20 16:04:06 UTC
Scenarios 1: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO First   Result: Pass
Scenarios 2: upgrade logging from 4.6 release to 4.6 latest -- Upgrade EO First    Result: Pass
Scenarios 3: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO and EO at same time   Result: Fail

The pull/858 is for 4.5. I don't think that can fix the Scenarios 3.

Comment 19 Jeff Cantrill 2021-01-20 20:12:34 UTC
(In reply to Anping Li from comment #18)
> Scenarios 1: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO
> First   Result: Pass
> Scenarios 2: upgrade logging from 4.6 release to 4.6 latest -- Upgrade EO
> First    Result: Pass
> Scenarios 3: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO
> and EO at same time   Result: Fail
> 
> The pull/858 is for 4.5. I don't think that can fix the Scenarios 3.

You are correct.  It will only fix a 4.5 to 4.6 latest assuming 4.5 has the cert change.  For 4.6 to 4.6 latest it will not and the only possibility is to only upgrade EO and allow it to settle before CLO or manually intervene to remove the pods

Comment 20 Jeff Cantrill 2021-01-20 20:14:31 UTC
Given #c18  where it looks like we have no good options, we may still hold this PR until 4.5 lands to resolve those cases but it might benefit users on 4.6. I defer to QE here

Comment 21 Jeff Cantrill 2021-01-20 20:16:37 UTC
*** Bug 1916911 has been marked as a duplicate of this bug. ***

Comment 22 Jeff Cantrill 2021-01-20 20:19:08 UTC
*** Bug 1918441 has been marked as a duplicate of this bug. ***

Comment 23 Jeff Cantrill 2021-01-21 22:02:45 UTC
@Anping moving this back to ONQA as there are no good options to address your finding other then to document the work around.  The cert changes are partially in 4.6 and this change will correct them but any earlier deployments will not have them.  There is a jira issue filed to guard against this but it would need to be backported and would otherwise hold up 4.6 changes in general.  I'm not certain how to best convey upgrading should be to:

1. Upgrade CLO followed by EO (or visa versa)
2. If issue, oc delete pod -l component=es to restart the es pods to resolve the issue

Comment 24 Anping Li 2021-01-22 03:47:18 UTC
@jeff,How the customer/support team can know the workaround before they hit this issue?

Comment 25 Anping Li 2021-01-22 14:06:21 UTC
Move to verified, the doc include it in the z-stream update for 4.6.13

Comment 27 errata-xmlrpc 2021-01-25 20:21:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.13 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0173


Note You need to log in before you can comment on or make changes to this bug.