Bug 1659848

Summary: Fluentd, Curator and Kibana are unable to connect to Elasticsearch.
Product: OpenShift Container Platform Reporter: Qiaoling Tang <qitang>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, rmeggins, wsun
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:41:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qiaoling Tang 2018-12-17 02:24:11 UTC
Description of problem:
Fluentd, Curator and Kibana are unable to connect to Elasticsearch, no other indices in ES pod except .searchguard, curator pod start failed.

$ oc exec elasticsearch-clientdatamaster-0-1-84d764899d-wl4l5 -- indices
Mon Dec 17 02:06:20 UTC 2018
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .searchguard rLx5fPSdTXCfIcX0vSQHrQ   1   0          5            0          0              0

$ oc logs curator-1545012660-cbrhd
Was not able to connect to Elasticearch at elasticsearch:9200 within 60 attempts

Fluentd pod logs:
2018-12-17 01:41:51 +0000 [warn]: [elasticsearch-apps] failed to flush the buffer. retry_time=8 next_retry_seconds=2018-12-17 01:43:53 +0000 chunk="57d2dd72fde8b7ccb8167bdbc2835b05" error_class=Faraday::SSLError error="SSL_connect returned=1 errno=0 state=error: certificate verify failed (unable to get local issuer certificate) (OpenSSL::SSL::SSLError) Unable to verify certificate. This may be an issue with the remote host or with Excon. Excon has certificates bundled, but these can be customized:\n\n            `Excon.defaults[:ssl_ca_path] = path_to_certs`\n            `ENV['SSL_CERT_DIR'] = path_to_certs`\n            `Excon.defaults[:ssl_ca_file] = path_to_file`\n            `ENV['SSL_CERT_FILE'] = path_to_file`\n            `Excon.defaults[:ssl_verify_callback] = callback`\n                (see OpenSSL::SSL::SSLContext#verify_callback)\nor:\n            `Excon.defaults[:ssl_verify_peer] = false` (less secure).\n"
  2018-12-17 01:41:51 +0000 [warn]: suppressed same stacktrace

Kibana logs:
$ oc logs -c kibana kibana-675b587dfd-prf28
{"type":"log","@timestamp":"2018-12-17T01:39:11Z","tags":["warning","elasticsearch","admin"],"pid":222,"message":"Unable to revive connection: https://elasticsearch:9200/"}
{"type":"log","@timestamp":"2018-12-17T01:39:11Z","tags":["warning","elasticsearch","admin"],"pid":222,"message":"No living connections"}
{"type":"log","@timestamp":"2018-12-17T01:39:11Z","tags":["status","plugin:elasticsearch.13","error"],"pid":222,"state":"red","message":"Status changed from red to red - Unable to connect to Elasticsearch at https://elasticsearch:9200.","prevState":"red","prevMsg":"Request Timeout after 3000ms"}
{"type":"log","@timestamp":"2018-12-17T01:39:14Z","tags":["warning","elasticsearch","admin"],"pid":222,"message":"Unable to revive connection: https://elasticsearch:9200/"}
{"type":"log","@timestamp":"2018-12-17T01:39:14Z","tags":["warning","elasticsearch","admin"],"pid":222,"message":"No living connections"}
{"type":"response","@timestamp":"2018-12-17T01:39:15Z","tags":[],"pid":222,"method":"head","statusCode":200,"req":{"url":"/","method":"head","headers":{"user-agent":"curl/7.29.0","host":"localhost:5601","accept":"*/*"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1"},"res":{"statusCode":200,"responseTime":6,"contentLength":9},"message":"HEAD / 200 6ms - 9.0B"}

ES pod logs:
[2018-12-17T01:57:49,792][ERROR][c.f.s.h.SearchGuardHttpServerTransport] [elasticsearch-clientdatamaster-0-1] SSL Problem General SSLEngine problem
javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1521) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:528) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:802) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:766) ~[?:?]
	at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) ~[?:1.8.0_191]
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:255) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1162) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) ~[netty-codec-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) ~[netty-codec-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) ~[netty-codec-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1709) ~[?:?]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:318) ~[?:?]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) ~[?:?]
	at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1983) ~[?:?]
	at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:232) ~[?:?]
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) ~[?:?]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:970) ~[?:?]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:967) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1459) ~[?:?]
	at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1301) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1214) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	... 18 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397) ~[?:?]
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302) ~[?:?]
	at sun.security.validator.Validator.validate(Validator.java:262) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:279) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(X509TrustManagerImpl.java:130) ~[?:?]
	at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1970) ~[?:?]
	at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:232) ~[?:?]
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) ~[?:?]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:970) ~[?:?]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:967) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1459) ~[?:?]
	at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1301) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1214) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	... 18 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) ~[?:?]
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) ~[?:?]
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) ~[?:1.8.0_191]
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392) ~[?:?]
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302) ~[?:?]
	at sun.security.validator.Validator.validate(Validator.java:262) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:279) ~[?:?]
	at sun.security.ssl.X509TrustManagerImpl.checkClientTrusted(X509TrustManagerImpl.java:130) ~[?:?]
	at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1970) ~[?:?]
	at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:232) ~[?:?]
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) ~[?:?]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:970) ~[?:?]
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:967) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1459) ~[?:?]
	at io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1301) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1214) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
	... 18 more


Version-Release number of selected component (if applicable):
# ./bin/openshift-install version
./bin/openshift-install v0.7.0-master-4-ga4e426ee762c20019bbb90fe35d33c9b26d23393


How reproducible:
Always

Steps to Reproduce:
1. Deploy logging EFK stack according to https://github.com/openshift/cluster-logging-operator#full-deploy
2.
3.

Actual results:


Expected results:


Additional info:
The test env is deployed by NextGen installer.

$ oc exec -c kibana kibana-675b587dfd-prf28 -- curl -s -k --cert /etc/kibana/keys/cert --key /etc/kibana/keys/key https://elasticsearch:9200 -v
* About to connect() to elasticsearch port 9200 (#0)
*   Trying 172.30.63.185...
* Connected to elasticsearch (172.30.63.185) port 9200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate from file
* 	subject: CN=system.logging.kibana,OU=OpenShift,O=Logging
* 	start date: Dec 17 01:37:40 2018 GMT
* 	expire date: Dec 16 01:37:40 2020 GMT
* 	common name: system.logging.kibana
* 	issuer: CN=openshift-cluster-logging-signer
* NSS error -12224 (SSL_ERROR_CERTIFICATE_UNKNOWN_ALERT)
* SSL peer had some unspecified issue with the certificate it received.
* Closing connection 0
command terminated with exit code 35

Comment 1 Jeff Cantrill 2018-12-19 20:31:07 UTC
You may have been in between commits.  Deployed a cluster today followed by logging using:

Installer: $ ./bin/openshift-install version
./bin/openshift-install v0.7.0-master
cluster-logging-operator: HEAD - 741444c81037b9a4ce26011ef6361dd9b28a9161
using the latest logging images as of 2018-dec-19

Comment 2 Qiaoling Tang 2018-12-20 02:53:36 UTC
Tested with:
# ./bin/openshift-install version
./bin/openshift-install v0.7.0-master-50-gded0735f92077f85de3d6a4dbc69b471b6683c3d
$ oc get clusterversion
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.alpha-2018-12-20-000118   True        False         34m       Cluster version is 4.0.0-0.alpha-2018-12-20-000118

The logging EFK works.

Comment 5 errata-xmlrpc 2019-06-04 10:41:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758