1672847 – While deploying logging in openshift cluster we are getting error while elasticsearch is initiating

Bug 1672847 - While deploying logging in openshift cluster we are getting error while elasticsearch is initiating

Summary: While deploying logging in openshift cluster we are getting error while elas...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Jeff Cantrill
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1678441 1678442
TreeView+	depends on / blocked

Reported:	2019-02-06 03:05 UTC by Miheer Salunke
Modified:	2022-03-13 16:56 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The ENV var was misspelled. Consequence: Certificate evaluation issue. Fix: Correct the name of the ENV var so it is honored by the dependent library Result:
Clone Of:
Clones:	1678441 1678442 (view as bug list)
Environment:
Last Closed:	2019-04-09 14:20:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 11207	'None'	closed	Bug 1672847. Corrected KUBERNETES_TRUST_CERTIFICATES reference.	2021-01-26 08:01:50 UTC
Github	openshift openshift-ansible pull 11217	'None'	closed	[release-3.9] Fix https://bugzilla.redhat.com/show_bug.cgi?id=1678442	2021-01-26 08:01:50 UTC
Red Hat Knowledge Base (Solution)	3921041	Troubleshoot	None	OpenShift Logging ElasticSearch cluster not forming due to: PKIX path building failed: sun.security.provider.certpath.Su...	2019-02-18 23:57:22 UTC
Red Hat Product Errata	RHBA-2019:0619	None	None	None	2019-04-09 14:20:25 UTC

Description Miheer Salunke 2019-02-06 03:05:22 UTC

Description of problem:

 While deploying  logging in openshift cluster we are getting error while elasticsearch is initiating. Info is attached.

[2019-02-05 07:13:43,200][INFO ][container.run            ] Begin Elasticsearch startup script
~/config ~
~
[2019-02-05 07:13:43,333][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2019-02-05 07:13:43,335][INFO ][container.run            ] Inspecting the maximum RAM available...
[2019-02-05 07:13:43,339][INFO ][container.run            ] ES_HEAP_SIZE: '2048m'
[2019-02-05 07:13:43,349][INFO ][container.run            ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof
[2019-02-05 07:13:43,353][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
[2019-02-05 07:19:32,580][ERROR][container.run            ] Timed out waiting for Elasticsearch to be ready
HTTP/1.1 503 Service Unavailable^M
Content-Type: text/plain; charset=UTF-8^M
Content-Length: 0^M
^M


Version-Release number of selected component (if applicable):
OCP 3.9

How reproducible:
Always on customer side

Steps to Reproduce:
1.Deploy logging using the inventory attached
2.
3.

Actual results:
ES shall not fail to initialize

Expected results:
ES fails with 503 

Additional info:
Details added in next private comment

Comment 3 Jeff Cantrill 2019-02-06 16:07:32 UTC

(In reply to Miheer Salunke from comment #1)
> From events also we don't get much clue.
    1        
> logging-es-data-master-x8jvzxet-6-69pk9.158066640ae5a1b7     Pod            
> Warning   FailedAttachVolume      attachdetach-controller                   
> Multi-Attach error for volume "pvc-e91be814-235f-11e9-b00c-000c29699361"
> Volume is already exclusively attached to one node and can't be attached to
> another

Is this an AWS backed Openshift cluster? The problem could be because the data volume from the previous deployment is not being released to be attached to the new pod.  We have seen issues in the dedicated clusters which are on AWS where it can take upwards of 20+ minutes for AWS to release the volume.  This causes additional problems in the deploymentconfig rollback time is only set to 5 minutes so the pod is rolled back before the volume can be attached leaving the pods in various states.  I recommend editing the deployment config to increase the rollback time to 1800 and then rolling out lastest [1].

[1] https://github.com/openshift/origin-aggregated-logging/blob/master/docs/troubleshooting.md#elasticsearch-deployments-never-succeed-and-rollback-to-previous-version

Comment 4 Vítor Corrêa 2019-02-06 20:13:42 UTC

This is an On-Premise Openshift Cluster. I changed the timeout to 1800 and rollout all the es deployments, but the problem persist with the same error.
[2019-02-06 19:54:19,509][INFO ][container.run            ] Begin Elasticsearch startup script
~/config ~
~
[2019-02-06 19:54:19,643][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2019-02-06 19:54:19,645][INFO ][container.run            ] Inspecting the maximum RAM available...
[2019-02-06 19:54:19,650][INFO ][container.run            ] ES_HEAP_SIZE: '2048m'
[2019-02-06 19:54:19,657][INFO ][container.run            ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof
[2019-02-06 19:54:19,660][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
[2019-02-06 20:00:10,647][ERROR][container.run            ] Timed out waiting for Elasticsearch to be ready
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain; charset=UTF-8
Content-Length: 0


There is also a log in /elasticsearch/persistent/logging-es/logs/logging-es.log (Not showing on console as stdout). that has the following. I think that will be an important input to the engineers

[2019-02-06 19:57:29,180][DEBUG][netty.handler.ssl.SslHandler] [id: 0xa0cfa7ae, /0:0:0:0:0:0:0:1:35350 => /0:0:0:0:0:0:0:1:9200] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[2019-02-06 19:57:29,187][DEBUG][netty.handler.ssl.SslHandler] [id: 0xfccc1473, /0:0:0:0:0:0:0:1:44858 => /0:0:0:0:0:0:0:1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[2019-02-06 19:57:29,187][DEBUG][netty.handler.ssl.SslHandler] [id: 0x75b71563, /0:0:0:0:0:0:0:1:44858 => /0:0:0:0:0:0:0:1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[2019-02-06 19:57:29,196][DEBUG][netty.handler.ssl.SslHandler] [id: 0x13a07f96, /127.0.0.1:55122 => /127.0.0.1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[2019-02-06 19:57:29,196][DEBUG][netty.handler.ssl.SslHandler] [id: 0x44ddbf7e, /127.0.0.1:55122 => /127.0.0.1:9300] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[2019-02-06 19:57:30,340][DEBUG][netty.handler.ssl.SslHandler] [id: 0xea867f82, /0:0:0:0:0:0:0:1:35392 => /0:0:0:0:0:0:0:1:9200] HANDSHAKEN: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[2019-02-06 19:57:30,681][WARN ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider] [logging-es-data-master-35ny545c] Exception caught during discovery: An error has occurred.
io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:57)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:385)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:57)
	at io.fabric8.elasticsearch.cloud.kubernetes.KubernetesAPIServiceImpl.pods(KubernetesAPIServiceImpl.java:47)
	at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.readNodes(KubernetesUnicastHostsProvider.java:190)
	at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.lambda$buildDynamicNodes$0(KubernetesUnicastHostsProvider.java:92)
	at java.security.AccessController.doPrivileged(Native Method)
	at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.buildDynamicNodes(KubernetesUnicastHostsProvider.java:91)
	at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:335)
	at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$2.doRun(UnicastZenPing.java:249)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037)
	at sun.security.ssl.Handshaker.process_record(Handshaker.java:965)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
	at com.squareup.okhttp.internal.io.RealConnection.connectTls(RealConnection.java:188)
	at com.squareup.okhttp.internal.io.RealConnection.connectSocket(RealConnection.java:145)
	at com.squareup.okhttp.internal.io.RealConnection.connect(RealConnection.java:108)
	at com.squareup.okhttp.internal.http.StreamAllocation.findConnection(StreamAllocation.java:184)
	at com.squareup.okhttp.internal.http.StreamAllocation.findHealthyConnection(StreamAllocation.java:126)
	at com.squareup.okhttp.internal.http.StreamAllocation.newStream(StreamAllocation.java:95)
	at com.squareup.okhttp.internal.http.HttpEngine.connect(HttpEngine.java:281)
	at com.squareup.okhttp.internal.http.HttpEngine.sendRequest(HttpEngine.java:224)
	at com.squareup.okhttp.Call.getResponse(Call.java:286)
	at com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:243)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils$3.intercept(HttpClientUtils.java:110)
	at com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:232)
	at com.squareup.okhttp.Call.getResponseWithInterceptorChain(Call.java:205)
	at com.squareup.okhttp.Call.execute(Call.java:80)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:210)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:383)
	... 12 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397)
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302)
	at sun.security.validator.Validator.validate(Validator.java:262)
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621)
	... 35 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392)

Comment 5 Jeff Cantrill 2019-02-06 21:41:31 UTC

From Eric W:

"
i think its k8s related, we don't generate that cert

that would be something for them to test... can they curl ES using certs from one of the other component pods?
if they cant, then ansible would need to regen the certs... if they can then i'd wager its k8s meaning we need to restart the pod so ES picks up the latest cert there
"
You can use [1] to check the connectivity between the certs from kibana to ES which may identify what Eric suggested about.  Depending on the outcome, you may need to either re-run ansible to regenerate your certs.  I think there may be a playbook to help here.  If you need to cycle the pods, for each ES deployment config, scale it to zero and then back to one.  Assuming this works you probably will need to do additional work to get the elasticsearch cluster back to a green state.

[1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-kibana-to-es-connectivity

Comment 11 elvisdom 2019-02-08 18:15:02 UTC

Hi Team

Actually our Red Hat Services team is performing an Openshift Installation on an estrategic client in Perú. Three weeks ago we Opened a ticket (02295613) for an EFK issue, but the issue persist. The client is expecting us to finish the cluster implementation this week to achive High Availability for their application in Production environment with 2 OCP cluster Active/Active. The Project times are being compromised by this problem.
Do you have any WA available or what other alternative do you suggest until the RCA of the Issue is obtained and we can have a definitive solution?
Thanks
Best Regards

Comment 12 Jeff Cantrill 2019-02-11 14:46:17 UTC

(In reply to elvisdom from comment #11)
> Hi Team
> 

> Do you have any WA available or what other alternative do you suggest until
> the RCA of the Issue is obtained and we can have a definitive solution?

One suggestion might be to edit each Elasticsearch DC and remove the block that defines readiness probes.  This would allow the pods to appear as ready to the platform and keep kubernetes from either rolling them back or restarting them.  It looks like you are unable to resolve the logging-es service which may be resolved when the pods appear ready as the endpoints would be published (if they are not already).

When the pods start, we can then further evaluate that a cluster has formed and work to get it into a green state.

Comment 15 Jeff Cantrill 2019-02-12 14:19:00 UTC

(In reply to Miheer Salunke from comment #14)
> I commented the readiness probe part in the deploymentConfig and the pods
> started. But the logs still showing the error.

> Elasticsearch is ready on https://localhost:9200
> [2019-02-11 20:03:13,351][ERROR][container.run            ] Timed out
> waiting for Elasticsearch to be ready
> HTTP/1.1 503 Service Unavailable
> Content-Type: text/plain; charset=UTF-8
> Content-Length: 0

This seems to still be some networking issue.  
* The elasticsearch services should have affiliated endpoints 'oc get endpoints -o wide' which would tell us if the service is mapped to the pods. 
* Additionally, you can run [1] to verify the elasticsearch pods are able to see one another.  
* Finally, the logs were moved off to persistent file so you may need to further investigate what is happening using 'oc exec -c elasticsearch $pod -- logs'

The logs should clearly indicate whether the nodes are able to see each other and which was elected master.  Once we are able to determine what is happening it will probably be necessary to reseed the cluster: 'oc exec -c elasticsearch $pod -- es_seed_acl'

[1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-es-cluster-connectivity

Comment 16 Robert Bost 2019-02-15 00:23:16 UTC

Please tell me to shut-up if I'm going off on a tangent from this bug but I think the Exception that Vitor posted in c#4 is the next issue to be understood. My understanding is that the logging-es pods are going to reach out to the Master API (kubernetes.default.svc by default) to ask about other logging-es pods it can cluster with. Like this:

 https://github.com/fabric8io/elasticsearch-cloud-kubernetes/blob/2.4.x/src/main/java/io/fabric8/elasticsearch/discovery/kubernetes/KubernetesUnicastHostsProvider.java#L190

Relevant portion of stack trace:

  [2019-02-06 19:57:30,681][WARN ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider] [logging-es-data-master-35ny545c] Exception caught during discovery: An error has occurred.
  io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:57)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:385)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:57)
	at io.fabric8.elasticsearch.cloud.kubernetes.KubernetesAPIServiceImpl.pods(KubernetesAPIServiceImpl.java:47)
	at io.fabric8.elasticsearch.discovery.kubernetes.KubernetesUnicastHostsProvider.readNodes(KubernetesUnicastHostsProvider.java:190)
        ...

The cause for the stack trace above like this:

  Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)
        ...

My understanding is that this means the truststore ElasticSearch is uses does *not* contain their kubernetes.default.svc cert or ca cert.

Questions:

- What is ElasticSearch's truststore used when speaking with Master API (kubernetes.default.svc)?
- Is ElasticSearch is indeed using kubernetes.default.svc or would it be using a public url maybe?


Lastly, I'm clearing Miheer's needinfo but I'll sync with him on chat to make sure he's aware.

Comment 17 Robert Bost 2019-02-15 19:50:27 UTC

Another finding while researching the customer case linked to this bug:

I believe the KUBERNETES_TRUST_CERT environment variable needs to be changed to KUBERNETES_TRUST_CERTIFICATES. Here's a link to where it is set on Ansible template:

 https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_logging_elasticsearch/templates/es.j2#L89-L91

Here's where KUBERNETES_TRUST_CERTIFICATES is used from kubernetes-client code:

 https://github.com/fabric8io/kubernetes-client/blob/v1.3.83/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Config.java#L41
 (NOTE: The system property mentioned above is converted kubernetes.trust.certificates -> KUBERNETES_TRUST_CERTIFICATES per https://github.com/fabric8io/kubernetes-client/blob/v1.3.83/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/utils/Utils.java#L42-L44)

KUBERNETES_TRUST_CERTIFICATES is what determines the return value of Config.isTrustCerts(). Later on, this is used to build a TrustManager that effectively bypasses certificate verification. This SSLUtils class is used by the clients in the same project and by elasticsearch-cloud-kubernetes project:

  https://github.com/fabric8io/kubernetes-client/blob/v1.3.83/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/internal/SSLUtils.java#L87-L104

Comment 18 elvisdom 2019-02-18 17:26:52 UTC

hi
The pod started after changing the variable KUBERNETES_TRUST_CERT to KUBERNETES_TRUST_CERTIFICATES. But it seems that something else is need to do for the cluster to be in green state.


this is a good progress. Can we do a remote session with our on-site consultant, to finish reviewing and placing the cluster in green?

We are on Waiting to follow the test on remote sesion at customer´s office.

Best regard

Elvis Dominguez

Comment 20 Robert Bost 2019-02-18 17:46:53 UTC

Submitted PR against release-3.11 branch https://github.com/openshift/openshift-ansible/pull/11207

Comment 22 Qiaoling Tang 2019-02-28 08:32:07 UTC

The fix isn't in openshift-ansible-3.9.71. Wait for another build.

Comment 27 Qiaoling Tang 2019-03-19 01:54:50 UTC

Verified in ose-ansible/images/v3.9.73-1

Comment 29 errata-xmlrpc 2019-04-09 14:20:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0619

Note You need to log in before you can comment on or make changes to this bug.