Bug 1543684
| Summary: | Elasticsearch logs show SSLHandshakeException: null cert chain | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
| Component: | Logging | Assignee: | Lukas Vlcek <lvlcek> | ||||
| Status: | CLOSED DEFERRED | QA Contact: | Anping Li <anli> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.9.0 | CC: | anli, aos-bugs, jeder, jpullen, lvlcek, mifiedle, pportant, rmeggins, wsun | ||||
| Target Milestone: | --- | Keywords: | Regression | ||||
| Target Release: | 3.9.z | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | aos-scalability-39 | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-11-20 18:50:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Mike Fiedler
2018-02-09 01:15:14 UTC
Created attachment 1393483 [details]
-Djavax.net.debug=all
Captured about 2 minutes after pod start - looks like it already wrapped. Let me know if you need it closer to pod start.
Verified this error does not occur on 3.7 GA (3.7.9 logging images from registry.access.redhat.com) 3.9 is registry.reg-aws.openshift.com:443/openshift3/logging-fluentd v3.9.0-0.38.0 9b4c38e83328 3 days ago 275.6 MB 3.7 is registry.access.redhat.com/openshift3/logging-fluentd v3.7.9-21 90b8b015ec6d 2 months ago 235 MB In the scalability lab with 200 compute nodes and a 3 node ES cluster, I also see the null cert chain message at 60 second intervals. It does not seem to be correlated with the number of fluentd nodes. I scaled kibana and curator to 0 and the messages still pop every 60 seconds. The issue is related to the readiness probe being periodically called by the cluster. Changed the user-agent and added thread to the log info at level trace:
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Handling Request... /
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][TRACE][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Request headers: [User-Agent=curl/readiness-probe, Host=localhost:9200, Accept=*/*]
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][TRACE][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Request context: [_sg_ssl_principal=>CN=system.admin,OU=OpenShift,O=Logging, _sg_ssl_peer_certificates=>[Ljava.security.cert.X509Certificate;@ac38d0e, _sg_ssl_cipher=>TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, _sg_ssl_protocol=>TLSv1.2]
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Evaluating request for user '' with a empty token
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Cache has user: false
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.OpenshiftRequestContextFactory] Returing EMPTY request context; either was provided client cert or empty token.
[2018-02-14 19:52:53,260][elasticsearch[logging-es-data-master-uv8mlpse][http_server_worker][T#7]{New I/O worker #25}][DEBUG][io.fabric8.elasticsearch.plugin.KibanaUserReindexFilter] user: ''/ requested index: ''/ kibana index: ''
[2018-02-14 19:52:54,967][elasticsearch[logging-es-data-master-uv8mlpse][[http_server_worker.default]][T#4]{New I/O worker #13}][ERROR][com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] [logging-es-data-master-uv8mlpse] SSL Problem null cert chain
javax.net.ssl.SSLHandshakeException: null cert chain
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1529)
Ignore c#7 as I completely did not recognize the failure occurs on a different thread. My investigation has led me to believe there is some process as part of this particular environment or as part of the Openshift cluster logic that is hitting the ES endpoint at 9200 and causing this error because the connecting client returns empty certificates as shown in the attached logs. This causes ES to disconnect and produces the error log. During investigation, I removed or changed a number of suspects in the deployment: * readiness probe * seeding function that gets backgrounded from run.sh * removed all services * scaled down all other logging pods * disabled openshift-elasticsearch-plugin * modifed the listening port * ES node count is 1 The error message still exists with all these changes that is essentially a plain ES deployment with SSL enabled and Kubernetes discovery type. When modifying the elasticsearch config [1] to bind to the loopback interface by setting network.host: _local_ the error goes away. There are zero occurences of the message. If I return the value to 0.0.0.0 the error message returns. My conclusion is if this were some internal process to ES that conflicts with the SSL plugin or the discovery plugin I should still see the error messages and I do not. [1] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-network.html @Seth, Is this something that maybe someone from your team could comment on based on c#8? Reevaluating this issue we have determined prometheus is the cause though specifically I don't know what. When scaling down the prometheus server the error goes away *** Bug 1551509 has been marked as a duplicate of this bug. *** The work around is to disable scraping the endpoint by modifying the annotation on the logging-es-prometheus service to be false: prometheus.io/scrape=false https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_elasticsearch/tasks/main.yaml#L374 This problem no longer occurs in OCP 3.10. Perhaps in 3.10 it was fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1572541 Shall we retest for 3.9 and jackpot if applicable? s/jackpot/backport/ Any progress backporting and/or verifying the fix from #1572541? We have consultants on site encountering this issue, but disabling the Prometheus scrape didn't seem to help. OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |