Bug 1543684 - Elasticsearch logs show SSLHandshakeException: null cert chain
Summary: Elasticsearch logs show SSLHandshakeException: null cert chain
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: 3.9.z
Assignee: Lukas Vlcek
QA Contact: Anping Li
URL:
Whiteboard: aos-scalability-39
: 1551509 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-09 01:15 UTC by Mike Fiedler
Modified: 2023-09-15 00:06 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-20 18:50:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
-Djavax.net.debug=all (2.70 MB, application/x-gzip)
2018-02-09 01:40 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2018-02-09 01:15:14 UTC
Description of problem:

This is an offshoot of bug 1541429.   While investigating that issue, the following exception was seen in the ES pod's /elasticsearch/logging-es/logs/logging-es.log.  It pops every minute:

[2018-02-08 00:00:24,544][ERROR][com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] [logging-es-data-master-uv8mlpse] SSL Problem null cert chain
javax.net.ssl.SSLHandshakeException: null cert chain
        at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1529)
        at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
        at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)
        at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
        at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
        at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1219)
        at org.jboss.netty.handler.ssl.SslHandler.decode(SslHandler.java:852)
        at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
        at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLHandshakeException: null cert chain
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
        at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1666)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:330)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:318)
        at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1935)
        at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:237)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1052)
        at sun.security.ssl.Handshaker$1.run(Handshaker.java:992)
        at sun.security.ssl.Handshaker$1.run(Handshaker.java:989)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1467)
        at org.jboss.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1393)
        at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1256)
        ... 18 more



Version-Release number of selected component (if applicable): Logging 3.9.0-0.36.0


How reproducible: Always


Steps to Reproduce:
1.  Deploy logging 3.9.0-0.36.0 (see bug 1541429 for configuration and inventory)
2.  oc rsh -c elasticsearch $ES_POD
3.  Check the contents of /elasticsearch/logging-es/logs/logging-es.log


Additional info:

Will collect logs with the following dc ENV setting and attach here:

name:ES_JAVA_OPTS
value: -Djavax.net.debug=all

Comment 1 Mike Fiedler 2018-02-09 01:40:59 UTC
Created attachment 1393483 [details]
-Djavax.net.debug=all

Captured about 2 minutes after pod start - looks like it already wrapped.   Let me know if you need it closer to pod start.

Comment 2 Mike Fiedler 2018-02-09 19:00:32 UTC
Verified this error does not occur on 3.7 GA (3.7.9 logging images from registry.access.redhat.com)

Comment 3 Mike Fiedler 2018-02-09 19:15:09 UTC
3.9 is registry.reg-aws.openshift.com:443/openshift3/logging-fluentd                                                                   v3.9.0-0.38.0       9b4c38e83328        3 days ago          275.6 MB

3.7 is registry.access.redhat.com/openshift3/logging-fluentd   v3.7.9-21           90b8b015ec6d        2 months ago        235 MB

Comment 6 Mike Fiedler 2018-02-14 16:52:30 UTC
In the scalability lab with 200 compute nodes and a 3 node ES cluster, I also see the null cert chain message at 60 second intervals.  It does not seem to be correlated with the number of fluentd nodes.

I scaled kibana and curator to 0 and the messages still pop every 60 seconds.

Comment 7 Jeff Cantrill 2018-02-14 20:01:24 UTC
The issue is related to the readiness probe being periodically called by the cluster.  Changed the user-agent and added thread to the log info at level trace:

[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Handling Request... /
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][TRACE][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Request headers: [User-Agent=curl/readiness-probe, Host=localhost:9200, Accept=*/*]
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][TRACE][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Request context: [_sg_ssl_principal=>CN=system.admin,OU=OpenShift,O=Logging, _sg_ssl_peer_certificates=>[Ljava.security.cert.X509Certificate;@ac38d0e, _sg_ssl_cipher=>TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, _sg_ssl_protocol=>TLSv1.2]
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Evaluating request for user '' with a empty token
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Cache has user: false
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Returing EMPTY request context; either was provided client cert or empty token.
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.KibanaUserReindexFilter] user: ''/ requested index: ''/ kibana index: ''
[2018-02-14 19:52:54,967][elasticsearch[logging-es-data-master-uv8mlpse][[http_server_worker.default]][T#4]{New I/O worker #13}][ERROR][com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] [logging-es-data-master-uv8mlpse] SSL Problem null cert chain
javax.net.ssl.SSLHandshakeException: null cert chain
	at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1529)

Comment 8 Jeff Cantrill 2018-02-15 21:55:44 UTC
Ignore c#7 as I completely did not recognize the failure occurs on a different thread.

My investigation has led me to believe there is some process as part of this particular environment or as part of the Openshift cluster logic that is hitting the ES endpoint at 9200 and causing this error because the connecting client returns empty certificates as shown in the attached logs.  This causes ES to disconnect and produces the error log.

During investigation, I removed or changed a number of suspects in the deployment:
* readiness probe
* seeding function that gets backgrounded from run.sh
* removed all services
* scaled down all other logging pods
* disabled openshift-elasticsearch-plugin
* modifed the listening port
* ES node count is 1

The error message still exists with all these changes that is essentially a plain ES deployment with SSL enabled and Kubernetes discovery type.

When modifying the elasticsearch config [1] to bind to the loopback interface by setting network.host: _local_ the error goes away.  There are zero occurences of the message. If I return the value to 0.0.0.0 the error message returns.

My conclusion is if this were some internal process to ES that conflicts with the SSL plugin or the discovery plugin I should still see the error messages and I do not.


[1] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-network.html

Comment 9 Jeff Cantrill 2018-02-15 21:56:58 UTC
@Seth,  Is this something that maybe someone from your team could comment on based on c#8?

Comment 10 Jeff Cantrill 2018-02-28 21:10:57 UTC
Reevaluating this issue we have determined prometheus is the cause though specifically I don't know what.  When scaling down the prometheus server the error goes away

Comment 11 Jeff Cantrill 2018-03-05 14:02:23 UTC
*** Bug 1551509 has been marked as a duplicate of this bug. ***

Comment 12 Jeff Cantrill 2018-03-12 19:44:20 UTC
The work around is to disable scraping the endpoint by modifying the annotation on the logging-es-prometheus service to be false:

prometheus.io/scrape=false

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_elasticsearch/tasks/main.yaml#L374

Comment 13 Mike Fiedler 2018-05-06 11:28:50 UTC
This problem no longer occurs in OCP 3.10.

Comment 14 Lukas Vlcek 2018-08-14 13:37:50 UTC
Perhaps in 3.10 it was fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1572541

Shall we retest for 3.9 and jackpot if applicable?

Comment 15 Lukas Vlcek 2018-08-14 13:38:31 UTC
s/jackpot/backport/

Comment 16 stbain 2018-09-20 18:25:36 UTC
Any progress backporting and/or verifying the fix from #1572541? We have consultants on site encountering this issue, but disabling the Prometheus scrape didn't seem to help.

Comment 17 Stephen Cuppett 2019-11-20 18:50:45 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift

Comment 19 Red Hat Bugzilla 2023-09-15 00:06:26 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.