Description of problem: When setting up our logging-fluentd to ship logs to external AWS Elasticsearch, it can ship logs just after starting a pod with some workaround referred in another Bugzilla [1]. But it will be continuously failed to ship logs for a while. In my tested environment, it took around 1 day after starting a pod. This is a particular issue for AWS elasticsearch, and I found some discussions about the similar issue [2], [3], [4]. Someone can work around with adding parameters into fluent.conf, but I and the customer behind this issue have no chance for this workaround. [1] https://bugzilla.redhat.com/1395448 [2] https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get-new-connection-from-pool-error/36252/11 [3] https://github.com/uken/fluent-plugin-elasticsearch/issues/182 [4] https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254868104 Version-Release number of selected component (if applicable): oc v3.3.1.3 kubernetes v1.3.0+52492b4 openshift v3.3.1.3 kubernetes v1.3.0+52492b4 openshift3/logging-fluentd:3.3.1 How reproducible: Yes. The customer found this issue first and I have reproduced it. Steps to Reproduce: 1. Set up AWS elasticsearch. 2. Set up logging stack to send logs to external elasticsearch. [5] 3. edit configmap to overwrite fluent.conf as a workaround [1] $ oc edit configmap logging-fluentd data: fluent.conf: | @include configs.d/openshift/system.conf @include configs.d/openshift/input-pre-*.conf @include configs.d/dynamic/input-docker-*.conf @include configs.d/dynamic/input-syslog-*.conf @include configs.d/openshift/input-post-*.conf <label @INGRESS> @include configs.d/openshift/filter-pre-*.conf @include configs.d/openshift/filter-retag-journal.conf @include configs.d/openshift/filter-k8s-meta.conf @include configs.d/openshift/filter-kibana-transform.conf @include configs.d/openshift/filter-k8s-flatten-hash.conf @include configs.d/openshift/filter-k8s-record-transform.conf @include configs.d/openshift/filter-syslog-record-transform.conf @include configs.d/openshift/filter-post-*.conf ## ## matches @include configs.d/openshift/output-pre-*.conf <match journal.system** system.var.log** **_default_** **_openshift_** **_openshift-infra_**> @type elasticsearch_dynamic host "#{ENV['OPS_HOST']}" port "#{ENV['OPS_PORT']}" scheme https index_name .operations.${record['time'].nil? ? Time.at(time).getutc.strftime(@logstash_dateformat) : Time.parse(record['time']).getutc.strftime(@logstash_dateformat)} ca_file "#{ENV['OPS_CA']}" flush_interval 5s max_retry_wait 300 disable_retry_limit resurrect_after 5s reload_connections false </match> <match **> @type elasticsearch_dynamic host "#{ENV['ES_HOST']}" port "#{ENV['ES_PORT']}" scheme https index_name ${record['kubernetes_namespace_name']}.${record['kubernetes_namespace_id']}.${Time.at(time).getutc.strftime(@logstash_dateformat)} ca_file "#{ENV['ES_CA']}" flush_interval 5s max_retry_wait 300 disable_retry_limit resurrect_after 5s reload_connections false </match> </label> kind: ConfigMap <snip> 4. Delete and recreate logging-fluent-template 5. Wait one day or so. [5] https://docs.openshift.com/container-platform/3.3/install_config/aggregate_logging.html#sending-logs-to-an-external-elasticsearch-instance Actual results: Suddenly get below errors. Once this error happened, all the retry request fails. 2016-11-28 09:17:30 -0500 [warn]: plugin/in_tail.rb:123:block (2 levels) in expand_paths: /var/log/containers/java-1-d7asb_logging_java-1413a22e918c020ec2fc32aca0714b5cdfd1310eb2532f712b0bc88916fced16.log unreadable. It is excluded and would be examined next time. 2016-11-28 13:47:38 -0500 [warn]: fluent/output.rb:362:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2016-11-28 13:47:39 -0500 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:158b398" 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/transport/base.rb:179:in `perform_request' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/client.rb:115:in `perform_request' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-api-1.0.7/lib/elasticsearch/api/actions/bulk.rb:80:in `bulk' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:179:in `send' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:171:in `block in write' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:170:in `each' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:170:in `write' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/buffer.rb:345:in `write_chunk' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/buffer.rb:324:in `pop' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/output.rb:329:in `try_flush' 2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/output.rb:140:in `run' 2016-11-28 13:47:39 -0500 [warn]: fluent/output.rb:362:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2016-11-28 13:47:41 -0500 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:158b398" 2016-11-28 13:47:39 -0500 [warn]: suppressed same stacktrace 2016-11-28 13:47:41 -0500 [warn]: fluent/output.rb:362:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2016-11-28 13:47:45 -0500 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:158b398" 2016-11-28 13:47:41 -0500 [warn]: suppressed same stacktrace Expected results: Shipping logs without error. Additional info:
Add PRs: * https://github.com/openshift/openshift-docs/pull/3333 * https://github.com/openshift/origin-aggregated-logging/pull/303
https://github.com/openshift/origin-aggregated-logging/pull/309
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/e6d2eab7aa53bc7330e9857f73e98d103b90feb2 Bug 1399388 - Failed to ship logs by "Cannot get new connection from pool." to AWS Elasticsearch after start logging-fluentd pod for a while https://bugzilla.redhat.com/show_bug.cgi?id=1399388 Now that the fluent-plugin-elasticsearch has support for being able to set `reload_connections false` and `reload_on_failure` false, set these so that fluentd will not ever attempt to reload connections to elasticsearch.
Fixed in 3.5.0 fluentd image
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1418911 and https://bugzilla.redhat.com/show_bug.cgi?id=1418912
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1420219
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421563
bug #1421563 is resolved, but still blocked by #1420219
Tested with /openshift3/logging-fluentd 3.5.0 47057624ecab 4 weeks ago 233.1 MB under fluentd/configs.d/openshift/, the 4 conf files contains reload_connections false reload_on_failure false so that fluentd will not ever attempt to reload connections to elasticsearch. @rich, qe can't create AWS Elasticsearch instance, our account don't have it. To verify this bug, can we use a standalone Elasticsearch instead?
@Takayoshi - any ideas about how to reproduce without using AWS Elasticsearch? @Peng Li - not sure - you could try having fluentd run for 3 days under a constant load
will try on OpenStack for 3 days running, currently we have an OCP installation blocker, will do the test once it's fixed.
@Rich It's difficult to reproduce this issue except for AWS elasticsearch. The one of the cause is below as I commented the first description. ``` AWS Elasticsearch is a managed service. It does not support any administration related API requests. Users can not directly connect/access elastic search nodes. sniffer.hosts tries to retrieve nodes from HTTP API endpoint _nodes. In above snippet, you can remove sniffer.hosts and provide AWS elastic search endpoint. In this case, your Ruby elastic search client will always make connection using same AWS elastic search endpoint which will solve your problem. ``` The standard elasticsearch won't effect this issue even if it runs for a long time. The one possibility to reproduce this issue is setting up elasticsearch cluster and proxy and you made the proxy not responding the value of "_nodes". However, I have no idea how to set up such a cluster. Again, if you can use AWS elasticsearch, how to reproduce the issue is: - to send logs not to standard elasticsearch but to AWS elasticsearch AND - to keep on sending logs about a day (in my case 18~24 hours is enough to reproduce)
Then, for QE, can we just confirm no regressions, and pass with a SanityCheck?
@Rich, can you help to see if below test is proper? Below test is executed: 1. Install 3.5.0 Logging using ansible, check log can be gathered and showed on Kibana. 2. Test ES-COPY feature, note that now there is no template 'logging-fluentd-template' in 3.5, thus the step is slightly different from the doc https://docs.openshift.com/container-platform/3.4/install_config/aggregate_logging.html I modify the daemonset directly. #oc edit daemonset/logging-fluentd set ES_COPY related parameters, and verify that each item is copied(send to the same ES twice) Version info: # docker images | grep logging logging-curator 3.5.0 8cfcb23f26b6 12 hours ago 211.1 MB logging-elasticsearch 3.5.0 d715f4d34ad4 2 weeks ago 399.2 MB logging-kibana 3.5.0 e0ab09c2cbeb 5 weeks ago 342.9 MB logging-fluentd 3.5.0 47057624ecab 5 weeks ago 233.1 MB logging-auth-proxy 3.5.0 139f7943475e 6 weeks ago 220 MB
After modify the daemonset, I delete all existing fluentd pods, thus the new created fluentd pods has the change.
There is an upstream test for test-es-copy.sh: https://github.com/openshift/origin-aggregated-logging/blob/master/hack/testing/test-es-copy.sh And as part of the 3.5 CI effort, we are adding back the fluentd template because it is needed for other tests: https://github.com/openshift/origin-aggregated-logging/pull/336/files#diff-b87fae4d61b8fe601653fe74e8caa472R250 @Peng Li - are you asking if testing the ES copy feature is sufficient to do a SanityCheck on this bz? I don't know. Do you have some sort of logging regression test suite you usually run?
Regression test passed, thus set status to Verified, version info is mentioned at Comment 39
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884