Bug 1399388
| Summary: | Failed to ship logs by "Cannot get new connection from pool." to AWS Elasticsearch after start logging-fluentd pod for a while | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Takayoshi Tanaka <tatanaka> |
| Component: | Logging | Assignee: | Rich Megginson <rmeggins> |
| Status: | CLOSED ERRATA | QA Contact: | Peng Li <penli> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.3.0 | CC: | anli, aos-bugs, drettori, erich, jcantril, rmeggins, tdawson, xiazhao |
| Target Milestone: | --- | Keywords: | Performance |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: |
undefined
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-04-12 19:17:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1427378 | ||
| Bug Blocks: | |||
|
Description
Takayoshi Tanaka
2016-11-29 00:23:05 UTC
Add PRs: * https://github.com/openshift/openshift-docs/pull/3333 * https://github.com/openshift/origin-aggregated-logging/pull/303 Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/e6d2eab7aa53bc7330e9857f73e98d103b90feb2 Bug 1399388 - Failed to ship logs by "Cannot get new connection from pool." to AWS Elasticsearch after start logging-fluentd pod for a while https://bugzilla.redhat.com/show_bug.cgi?id=1399388 Now that the fluent-plugin-elasticsearch has support for being able to set `reload_connections false` and `reload_on_failure` false, set these so that fluentd will not ever attempt to reload connections to elasticsearch. Fixed in 3.5.0 fluentd image Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1418911 and https://bugzilla.redhat.com/show_bug.cgi?id=1418912 bug #1421563 is resolved, but still blocked by #1420219 Tested with
/openshift3/logging-fluentd 3.5.0 47057624ecab 4 weeks ago 233.1 MB
under fluentd/configs.d/openshift/, the 4 conf files contains
reload_connections false
reload_on_failure false
so that fluentd will not ever attempt to reload connections to elasticsearch.
@rich, qe can't create AWS Elasticsearch instance, our account don't have it. To verify this bug, can we use a standalone Elasticsearch instead?
@Takayoshi - any ideas about how to reproduce without using AWS Elasticsearch? @Peng Li - not sure - you could try having fluentd run for 3 days under a constant load will try on OpenStack for 3 days running, currently we have an OCP installation blocker, will do the test once it's fixed. @Rich It's difficult to reproduce this issue except for AWS elasticsearch. The one of the cause is below as I commented the first description. ``` AWS Elasticsearch is a managed service. It does not support any administration related API requests. Users can not directly connect/access elastic search nodes. sniffer.hosts tries to retrieve nodes from HTTP API endpoint _nodes. In above snippet, you can remove sniffer.hosts and provide AWS elastic search endpoint. In this case, your Ruby elastic search client will always make connection using same AWS elastic search endpoint which will solve your problem. ``` The standard elasticsearch won't effect this issue even if it runs for a long time. The one possibility to reproduce this issue is setting up elasticsearch cluster and proxy and you made the proxy not responding the value of "_nodes". However, I have no idea how to set up such a cluster. Again, if you can use AWS elasticsearch, how to reproduce the issue is: - to send logs not to standard elasticsearch but to AWS elasticsearch AND - to keep on sending logs about a day (in my case 18~24 hours is enough to reproduce) Then, for QE, can we just confirm no regressions, and pass with a SanityCheck? @Rich, can you help to see if below test is proper? Below test is executed: 1. Install 3.5.0 Logging using ansible, check log can be gathered and showed on Kibana. 2. Test ES-COPY feature, note that now there is no template 'logging-fluentd-template' in 3.5, thus the step is slightly different from the doc https://docs.openshift.com/container-platform/3.4/install_config/aggregate_logging.html I modify the daemonset directly. #oc edit daemonset/logging-fluentd set ES_COPY related parameters, and verify that each item is copied(send to the same ES twice) Version info: # docker images | grep logging logging-curator 3.5.0 8cfcb23f26b6 12 hours ago 211.1 MB logging-elasticsearch 3.5.0 d715f4d34ad4 2 weeks ago 399.2 MB logging-kibana 3.5.0 e0ab09c2cbeb 5 weeks ago 342.9 MB logging-fluentd 3.5.0 47057624ecab 5 weeks ago 233.1 MB logging-auth-proxy 3.5.0 139f7943475e 6 weeks ago 220 MB After modify the daemonset, I delete all existing fluentd pods, thus the new created fluentd pods has the change. There is an upstream test for test-es-copy.sh: https://github.com/openshift/origin-aggregated-logging/blob/master/hack/testing/test-es-copy.sh And as part of the 3.5 CI effort, we are adding back the fluentd template because it is needed for other tests: https://github.com/openshift/origin-aggregated-logging/pull/336/files#diff-b87fae4d61b8fe601653fe74e8caa472R250 @Peng Li - are you asking if testing the ES copy feature is sufficient to do a SanityCheck on this bz? I don't know. Do you have some sort of logging regression test suite you usually run? Regression test passed, thus set status to Verified, version info is mentioned at Comment 39 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884 |