Description of problem: While deploying logging in openshift cluster we are getting error while elasticsearch is initiating. Info is attached. [2019-02-05 07:13:43,200][INFO ][container.run ] Begin Elasticsearch startup script ~/config ~ ~ [2019-02-05 07:13:43,333][INFO ][container.run ] Comparing the specified RAM to the maximum recommended for Elasticsearch... [2019-02-05 07:13:43,335][INFO ][container.run ] Inspecting the maximum RAM available... [2019-02-05 07:13:43,339][INFO ][container.run ] ES_HEAP_SIZE: '2048m' [2019-02-05 07:13:43,349][INFO ][container.run ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof [2019-02-05 07:13:43,353][INFO ][container.run ] Checking if Elasticsearch is ready on https://localhost:9200 [2019-02-05 07:19:32,580][ERROR][container.run ] Timed out waiting for Elasticsearch to be ready HTTP/1.1 503 Service Unavailable^M Content-Type: text/plain; charset=UTF-8^M Content-Length: 0^M ^M Version-Release number of selected component (if applicable): OCP 3.9 How reproducible: Always on customer side Steps to Reproduce: 1.Deploy logging using the inventory attached 2. 3. Actual results: ES shall not fail to initialize Expected results: ES fails with 503 Additional info: Details added in next private comment
(In reply to Miheer Salunke from comment #1) > From events also we don't get much clue. 1 > logging-es-data-master-x8jvzxet-6-69pk9.158066640ae5a1b7 Pod > Warning FailedAttachVolume attachdetach-controller > Multi-Attach error for volume "pvc-e91be814-235f-11e9-b00c-000c29699361" > Volume is already exclusively attached to one node and can't be attached to > another Is this an AWS backed Openshift cluster? The problem could be because the data volume from the previous deployment is not being released to be attached to the new pod. We have seen issues in the dedicated clusters which are on AWS where it can take upwards of 20+ minutes for AWS to release the volume. This causes additional problems in the deploymentconfig rollback time is only set to 5 minutes so the pod is rolled back before the volume can be attached leaving the pods in various states. I recommend editing the deployment config to increase the rollback time to 1800 and then rolling out lastest [1]. [1] https://github.com/openshift/origin-aggregated-logging/blob/master/docs/troubleshooting.md#elasticsearch-deployments-never-succeed-and-rollback-to-previous-version
This is an On-Premise Openshift Cluster. I changed the timeout to 1800 and rollout all the es deployments, but the problem persist with the same error. [2019-02-06 19:54:19,509][INFO ][container.run ] Begin Elasticsearch startup script ~/config ~ ~ [2019-02-06 19:54:19,643][INFO ][container.run ] Comparing the specified RAM to the maximum recommended for Elasticsearch... [2019-02-06 19:54:19,645][INFO ][container.run ] Inspecting the maximum RAM available... [2019-02-06 19:54:19,650][INFO ][container.run ] ES_HEAP_SIZE: '2048m' [2019-02-06 19:54:19,657][INFO ][container.run ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof [2019-02-06 19:54:19,660][INFO ][container.run ] Checking if Elasticsearch is ready on https://localhost:9200 [2019-02-06 20:00:10,647][ERROR][container.run ] Timed out waiting for Elasticsearch to be ready HTTP/1.1 503 Service Unavailable Content-Type: text/plain; charset=UTF-8 Content-Length: 0 There is also a log in /elasticsearch/persistent/logging-es/logs/logging-es.log (Not showing on console as stdout). that has the following. I think that will be an important input to the engineers [2019-02-06 19:57:29,180][DEBUG][netty.handler.ssl.SslHandler] [id: 0xa0cfa7ae, /0:0:0:0:0:0:0:1:35350 => /0:0:0:0:0:0:0:1:9200] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 [2019-02-06 19:57:29,187][DEBUG][netty.handler.ssl.SslHandler] [id: 0xfccc1473, /0:0:0:0:0:0:0:1:44858 => /0:0:0:0:0:0:0:1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 [2019-02-06 19:57:29,187][DEBUG][netty.handler.ssl.SslHandler] [id: 0x75b71563, /0:0:0:0:0:0:0:1:44858 => /0:0:0:0:0:0:0:1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 [2019-02-06 19:57:29,196][DEBUG][netty.handler.ssl.SslHandler] [id: 0x13a07f96, /127.0.0.1:55122 => /127.0.0.1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 [2019-02-06 19:57:29,196][DEBUG][netty.handler.ssl.SslHandler] [id: 0x44ddbf7e, /127.0.0.1:55122 => /127.0.0.1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 [2019-02-06 19:57:30,340][DEBUG][netty.handler.ssl.SslHandler] [id: 0xea867f82, /0:0:0:0:0:0:0:1:35392 => /0:0:0:0:0:0:0:1:9200] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 [2019-02-06 19:57:30,681][WARN ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider] [logging-es-data-master-35ny545c] Exception caught during discovery: An error has occurred. io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:57) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:385) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:57) at io.fabric8.elasticsearch.cloud.kubernetes.KubernetesAPIServiceImpl.pods(KubernetesAPIServiceImpl.java:47) at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.readNodes(KubernetesUnicastHostsProvider.java:190) at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.lambda$buildDynamicNodes$0(KubernetesUnicastHostsProvider.java:92) at java.security.AccessController.doPrivileged(Native Method) at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.buildDynamicNodes(KubernetesUnicastHostsProvider.java:91) at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:335) at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$2.doRun(UnicastZenPing.java:249) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) at sun.security.ssl.Handshaker.process_record(Handshaker.java:965) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at com.squareup.okhttp.internal.io.RealConnection.connectTls(RealConnection.java:188) at com.squareup.okhttp.internal.io.RealConnection.connectSocket(RealConnection.java:145) at com.squareup.okhttp.internal.io.RealConnection.connect(RealConnection.java:108) at com.squareup.okhttp.internal.http.StreamAllocation.findConnection(StreamAllocation.java:184) at com.squareup.okhttp.internal.http.StreamAllocation.findHealthyConnection(StreamAllocation.java:126) at com.squareup.okhttp.internal.http.StreamAllocation.newStream(StreamAllocation.java:95) at com.squareup.okhttp.internal.http.HttpEngine.connect(HttpEngine.java:281) at com.squareup.okhttp.internal.http.HttpEngine.sendRequest(HttpEngine.java:224) at com.squareup.okhttp.Call.getResponse(Call.java:286) at com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:243) at io.fabric8.kubernetes.client.utils.HttpClientUtils$3.intercept(HttpClientUtils.java:110) at com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:232) at com.squareup.okhttp.Call.getResponseWithInterceptorChain(Call.java:205) at com.squareup.okhttp.Call.execute(Call.java:80) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:210) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:383) ... 12 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302) at sun.security.validator.Validator.validate(Validator.java:262) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621) ... 35 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392)
From Eric W: " i think its k8s related, we don't generate that cert that would be something for them to test... can they curl ES using certs from one of the other component pods? if they cant, then ansible would need to regen the certs... if they can then i'd wager its k8s meaning we need to restart the pod so ES picks up the latest cert there " You can use [1] to check the connectivity between the certs from kibana to ES which may identify what Eric suggested about. Depending on the outcome, you may need to either re-run ansible to regenerate your certs. I think there may be a playbook to help here. If you need to cycle the pods, for each ES deployment config, scale it to zero and then back to one. Assuming this works you probably will need to do additional work to get the elasticsearch cluster back to a green state. [1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-kibana-to-es-connectivity
Hi Team Actually our Red Hat Services team is performing an Openshift Installation on an estrategic client in Perú. Three weeks ago we Opened a ticket (02295613) for an EFK issue, but the issue persist. The client is expecting us to finish the cluster implementation this week to achive High Availability for their application in Production environment with 2 OCP cluster Active/Active. The Project times are being compromised by this problem. Do you have any WA available or what other alternative do you suggest until the RCA of the Issue is obtained and we can have a definitive solution? Thanks Best Regards
(In reply to elvisdom from comment #11) > Hi Team > > Do you have any WA available or what other alternative do you suggest until > the RCA of the Issue is obtained and we can have a definitive solution? One suggestion might be to edit each Elasticsearch DC and remove the block that defines readiness probes. This would allow the pods to appear as ready to the platform and keep kubernetes from either rolling them back or restarting them. It looks like you are unable to resolve the logging-es service which may be resolved when the pods appear ready as the endpoints would be published (if they are not already). When the pods start, we can then further evaluate that a cluster has formed and work to get it into a green state.
(In reply to Miheer Salunke from comment #14) > I commented the readiness probe part in the deploymentConfig and the pods > started. But the logs still showing the error. > Elasticsearch is ready on https://localhost:9200 > [2019-02-11 20:03:13,351][ERROR][container.run ] Timed out > waiting for Elasticsearch to be ready > HTTP/1.1 503 Service Unavailable > Content-Type: text/plain; charset=UTF-8 > Content-Length: 0 This seems to still be some networking issue. * The elasticsearch services should have affiliated endpoints 'oc get endpoints -o wide' which would tell us if the service is mapped to the pods. * Additionally, you can run [1] to verify the elasticsearch pods are able to see one another. * Finally, the logs were moved off to persistent file so you may need to further investigate what is happening using 'oc exec -c elasticsearch $pod -- logs' The logs should clearly indicate whether the nodes are able to see each other and which was elected master. Once we are able to determine what is happening it will probably be necessary to reseed the cluster: 'oc exec -c elasticsearch $pod -- es_seed_acl' [1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-es-cluster-connectivity
Please tell me to shut-up if I'm going off on a tangent from this bug but I think the Exception that Vitor posted in c#4 is the next issue to be understood. My understanding is that the logging-es pods are going to reach out to the Master API (kubernetes.default.svc by default) to ask about other logging-es pods it can cluster with. Like this: https://github.com/fabric8io/elasticsearch-cloud-kubernetes/blob/2.4.x/src/main/java/io/fabric8/elasticsearch/discovery/kubernetes/KubernetesUnicastHostsProvider.java#L190 Relevant portion of stack trace: [2019-02-06 19:57:30,681][WARN ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider] [logging-es-data-master-35ny545c] Exception caught during discovery: An error has occurred. io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:57) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:385) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:57) at io.fabric8.elasticsearch.cloud.kubernetes.KubernetesAPIServiceImpl.pods(KubernetesAPIServiceImpl.java:47) at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.readNodes(KubernetesUnicastHostsProvider.java:190) ... The cause for the stack trace above like this: Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) ... My understanding is that this means the truststore ElasticSearch is uses does *not* contain their kubernetes.default.svc cert or ca cert. Questions: - What is ElasticSearch's truststore used when speaking with Master API (kubernetes.default.svc)? - Is ElasticSearch is indeed using kubernetes.default.svc or would it be using a public url maybe? Lastly, I'm clearing Miheer's needinfo but I'll sync with him on chat to make sure he's aware.
Another finding while researching the customer case linked to this bug: I believe the KUBERNETES_TRUST_CERT environment variable needs to be changed to KUBERNETES_TRUST_CERTIFICATES. Here's a link to where it is set on Ansible template: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_logging_elasticsearch/templates/es.j2#L89-L91 Here's where KUBERNETES_TRUST_CERTIFICATES is used from kubernetes-client code: https://github.com/fabric8io/kubernetes-client/blob/v1.3.83/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Config.java#L41 (NOTE: The system property mentioned above is converted kubernetes.trust.certificates -> KUBERNETES_TRUST_CERTIFICATES per https://github.com/fabric8io/kubernetes-client/blob/v1.3.83/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/utils/Utils.java#L42-L44) KUBERNETES_TRUST_CERTIFICATES is what determines the return value of Config.isTrustCerts(). Later on, this is used to build a TrustManager that effectively bypasses certificate verification. This SSLUtils class is used by the clients in the same project and by elasticsearch-cloud-kubernetes project: https://github.com/fabric8io/kubernetes-client/blob/v1.3.83/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/internal/SSLUtils.java#L87-L104
hi The pod started after changing the variable KUBERNETES_TRUST_CERT to KUBERNETES_TRUST_CERTIFICATES. But it seems that something else is need to do for the cluster to be in green state. this is a good progress. Can we do a remote session with our on-site consultant, to finish reviewing and placing the cluster in green? We are on Waiting to follow the test on remote sesion at customer´s office. Best regard Elvis Dominguez
Submitted PR against release-3.11 branch https://github.com/openshift/openshift-ansible/pull/11207
The fix isn't in openshift-ansible-3.9.71. Wait for another build.
Verified in ose-ansible/images/v3.9.73-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0619