Description of problem: This is an offshoot of bug 1541429. While investigating that issue, the following exception was seen in the ES pod's /elasticsearch/logging-es/logs/logging-es.log. It pops every minute: [2018-02-08 00:00:24,544][ERROR][com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] [logging-es-data-master-uv8mlpse] SSL Problem null cert chain javax.net.ssl.SSLHandshakeException: null cert chain at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1529) at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813) at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1219) at org.jboss.netty.handler.ssl.SslHandler.decode(SslHandler.java:852) at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425) at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: javax.net.ssl.SSLHandshakeException: null cert chain at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1666) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:330) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:318) at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1935) at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:237) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1052) at sun.security.ssl.Handshaker$1.run(Handshaker.java:992) at sun.security.ssl.Handshaker$1.run(Handshaker.java:989) at java.security.AccessController.doPrivileged(Native Method) at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1467) at org.jboss.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1393) at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1256) ... 18 more Version-Release number of selected component (if applicable): Logging 3.9.0-0.36.0 How reproducible: Always Steps to Reproduce: 1. Deploy logging 3.9.0-0.36.0 (see bug 1541429 for configuration and inventory) 2. oc rsh -c elasticsearch $ES_POD 3. Check the contents of /elasticsearch/logging-es/logs/logging-es.log Additional info: Will collect logs with the following dc ENV setting and attach here: name:ES_JAVA_OPTS value: -Djavax.net.debug=all
Created attachment 1393483 [details] -Djavax.net.debug=all Captured about 2 minutes after pod start - looks like it already wrapped. Let me know if you need it closer to pod start.
Verified this error does not occur on 3.7 GA (3.7.9 logging images from registry.access.redhat.com)
3.9 is registry.reg-aws.openshift.com:443/openshift3/logging-fluentd v3.9.0-0.38.0 9b4c38e83328 3 days ago 275.6 MB 3.7 is registry.access.redhat.com/openshift3/logging-fluentd v3.7.9-21 90b8b015ec6d 2 months ago 235 MB
In the scalability lab with 200 compute nodes and a 3 node ES cluster, I also see the null cert chain message at 60 second intervals. It does not seem to be correlated with the number of fluentd nodes. I scaled kibana and curator to 0 and the messages still pop every 60 seconds.
The issue is related to the readiness probe being periodically called by the cluster. Changed the user-agent and added thread to the log info at level trace: [2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Handling Request... / [2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][TRACE][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Request headers: [User-Agent=curl/readiness-probe, Host=localhost:9200, Accept=*/*] [2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][TRACE][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Request context: [_sg_ssl_principal=>CN=system.admin,OU=OpenShift,O=Logging, _sg_ssl_peer_certificates=>[Ljava.security.cert.X509Certificate;@ac38d0e, _sg_ssl_cipher=>TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, _sg_ssl_protocol=>TLSv1.2] [2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Evaluating request for user '' with a empty token [2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Cache has user: false [2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Returing EMPTY request context; either was provided client cert or empty token. [2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.KibanaUserReindexFilter] user: ''/ requested index: ''/ kibana index: '' [2018-02-14 19:52:54,967][elasticsearch[logging-es-data-master-uv8mlpse][[http_server_worker.default]][T#4]{New I/O worker #13}][ERROR][com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] [logging-es-data-master-uv8mlpse] SSL Problem null cert chain javax.net.ssl.SSLHandshakeException: null cert chain at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1529)
Ignore c#7 as I completely did not recognize the failure occurs on a different thread. My investigation has led me to believe there is some process as part of this particular environment or as part of the Openshift cluster logic that is hitting the ES endpoint at 9200 and causing this error because the connecting client returns empty certificates as shown in the attached logs. This causes ES to disconnect and produces the error log. During investigation, I removed or changed a number of suspects in the deployment: * readiness probe * seeding function that gets backgrounded from run.sh * removed all services * scaled down all other logging pods * disabled openshift-elasticsearch-plugin * modifed the listening port * ES node count is 1 The error message still exists with all these changes that is essentially a plain ES deployment with SSL enabled and Kubernetes discovery type. When modifying the elasticsearch config [1] to bind to the loopback interface by setting network.host: _local_ the error goes away. There are zero occurences of the message. If I return the value to 0.0.0.0 the error message returns. My conclusion is if this were some internal process to ES that conflicts with the SSL plugin or the discovery plugin I should still see the error messages and I do not. [1] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-network.html
@Seth, Is this something that maybe someone from your team could comment on based on c#8?
Reevaluating this issue we have determined prometheus is the cause though specifically I don't know what. When scaling down the prometheus server the error goes away
*** Bug 1551509 has been marked as a duplicate of this bug. ***
The work around is to disable scraping the endpoint by modifying the annotation on the logging-es-prometheus service to be false: prometheus.io/scrape=false https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_elasticsearch/tasks/main.yaml#L374
This problem no longer occurs in OCP 3.10.
Perhaps in 3.10 it was fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1572541 Shall we retest for 3.9 and jackpot if applicable?
s/jackpot/backport/
Any progress backporting and/or verifying the fix from #1572541? We have consultants on site encountering this issue, but disabling the Prometheus scrape didn't seem to help.
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days