1399388 – Failed to ship logs by "Cannot get new connection from pool." to AWS Elasticsearch after start logging-fluentd pod for a while

Bug 1399388 - Failed to ship logs by "Cannot get new connection from pool." to AWS Elasticsearch after start logging-fluentd pod for a while

Summary: Failed to ship logs by "Cannot get new connection from pool." to AWS Elastics...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Rich Megginson
QA Contact:	Peng Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1427378
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-29 00:23 UTC by Takayoshi Tanaka
Modified:	2020-05-14 15:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2017-04-12 19:17:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0884	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Takayoshi Tanaka 2016-11-29 00:23:05 UTC

Description of problem:
When setting up our logging-fluentd to ship logs to external AWS Elasticsearch, it can ship logs just after starting a pod with some workaround referred in another Bugzilla [1]. But it will be continuously failed to ship logs for a while. In my tested environment, it took around 1 day after starting a pod.
This is a particular issue for AWS elasticsearch, and I found some discussions about the similar issue [2], [3], [4]. Someone can work around with adding parameters into fluent.conf, but I and the customer behind this issue have no chance for this workaround.

[1] https://bugzilla.redhat.com/1395448
[2] https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get-new-connection-from-pool-error/36252/11
[3] https://github.com/uken/fluent-plugin-elasticsearch/issues/182
[4] https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254868104

Version-Release number of selected component (if applicable):
oc v3.3.1.3
kubernetes v1.3.0+52492b4
openshift v3.3.1.3
kubernetes v1.3.0+52492b4

openshift3/logging-fluentd:3.3.1

How reproducible:
Yes. The customer found this issue first and I have reproduced it.

Steps to Reproduce:
1. Set up AWS elasticsearch.
2. Set up logging stack to send logs to external elasticsearch. [5]
3. edit configmap to overwrite fluent.conf as a workaround [1]
$ oc edit configmap logging-fluentd
data:
  fluent.conf: |
    @include configs.d/openshift/system.conf

    @include configs.d/openshift/input-pre-*.conf
    @include configs.d/dynamic/input-docker-*.conf
    @include configs.d/dynamic/input-syslog-*.conf
    @include configs.d/openshift/input-post-*.conf

    <label @INGRESS>
      @include configs.d/openshift/filter-pre-*.conf
      @include configs.d/openshift/filter-retag-journal.conf
      @include configs.d/openshift/filter-k8s-meta.conf
      @include configs.d/openshift/filter-kibana-transform.conf
      @include configs.d/openshift/filter-k8s-flatten-hash.conf
      @include configs.d/openshift/filter-k8s-record-transform.conf
      @include configs.d/openshift/filter-syslog-record-transform.conf
      @include configs.d/openshift/filter-post-*.conf
    ##

    ## matches
      @include configs.d/openshift/output-pre-*.conf
      <match journal.system** system.var.log** **_default_** **_openshift_** **_openshift-infra_**>
        @type elasticsearch_dynamic
        host "#{ENV['OPS_HOST']}"
        port "#{ENV['OPS_PORT']}"
        scheme https
        index_name .operations.${record['time'].nil? ? Time.at(time).getutc.strftime(@logstash_dateformat) : Time.parse(record['time']).getutc.strftime(@logstash_dateformat)}

        ca_file "#{ENV['OPS_CA']}"

        flush_interval 5s
        max_retry_wait 300
        disable_retry_limit
        resurrect_after 5s
        reload_connections false
      </match>
      <match **>
        @type elasticsearch_dynamic
        host "#{ENV['ES_HOST']}"
        port "#{ENV['ES_PORT']}"
        scheme https
        index_name ${record['kubernetes_namespace_name']}.${record['kubernetes_namespace_id']}.${Time.at(time).getutc.strftime(@logstash_dateformat)}

        ca_file "#{ENV['ES_CA']}"

        flush_interval 5s
        max_retry_wait 300
        disable_retry_limit
        resurrect_after 5s
        reload_connections false
      </match>
    </label>
kind: ConfigMap
<snip>

4. Delete and recreate logging-fluent-template
5. Wait one day or so.

[5] https://docs.openshift.com/container-platform/3.3/install_config/aggregate_logging.html#sending-logs-to-an-external-elasticsearch-instance

Actual results:
Suddenly get below errors. Once this error happened, all the retry request fails.

2016-11-28 09:17:30 -0500 [warn]: plugin/in_tail.rb:123:block (2 levels) in expand_paths: /var/log/containers/java-1-d7asb_logging_java-1413a22e918c020ec2fc32aca0714b5cdfd1310eb2532f712b0bc88916fced16.log unreadable. It is excluded and would be examined next time.
2016-11-28 13:47:38 -0500 [warn]: fluent/output.rb:362:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2016-11-28 13:47:39 -0500 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:158b398"
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/transport/base.rb:179:in `perform_request'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/client.rb:115:in `perform_request'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/elasticsearch-api-1.0.7/lib/elasticsearch/api/actions/bulk.rb:80:in `bulk'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:179:in `send'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:171:in `block in write'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:170:in `each'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.3.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb:170:in `write'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/buffer.rb:345:in `write_chunk'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/buffer.rb:324:in `pop'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/output.rb:329:in `try_flush'
  2016-11-28 13:47:38 -0500 [warn]: /usr/share/gems/gems/fluentd-0.12.20/lib/fluent/output.rb:140:in `run'
2016-11-28 13:47:39 -0500 [warn]: fluent/output.rb:362:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2016-11-28 13:47:41 -0500 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:158b398"
  2016-11-28 13:47:39 -0500 [warn]: suppressed same stacktrace
2016-11-28 13:47:41 -0500 [warn]: fluent/output.rb:362:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2016-11-28 13:47:45 -0500 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:158b398"
  2016-11-28 13:47:41 -0500 [warn]: suppressed same stacktrace


Expected results:
Shipping logs without error.

Additional info:

Comment 11 Jeff Cantrill 2016-12-06 19:11:39 UTC

Add PRs:

* https://github.com/openshift/openshift-docs/pull/3333
* https://github.com/openshift/origin-aggregated-logging/pull/303

Comment 27 Rich Megginson 2016-12-14 17:21:21 UTC

https://github.com/openshift/origin-aggregated-logging/pull/309

Comment 28 openshift-github-bot 2016-12-14 20:48:40 UTC

Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/e6d2eab7aa53bc7330e9857f73e98d103b90feb2
Bug 1399388 - Failed to ship logs by "Cannot get new connection from pool." to AWS Elasticsearch after start logging-fluentd pod for a while

https://bugzilla.redhat.com/show_bug.cgi?id=1399388
Now that the fluent-plugin-elasticsearch has support for being able to
set `reload_connections false` and `reload_on_failure` false, set these
so that fluentd will not ever attempt to reload connections to
elasticsearch.

Comment 29 Rich Megginson 2017-01-25 00:27:44 UTC

Fixed in 3.5.0 fluentd image

Comment 30 Xia Zhao 2017-02-03 06:20:45 UTC

Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1418911 and https://bugzilla.redhat.com/show_bug.cgi?id=1418912

Comment 31 Xia Zhao 2017-02-09 10:24:15 UTC

Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1420219

Comment 32 Xia Zhao 2017-02-13 05:52:47 UTC

blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1421563

Comment 33 Xia Zhao 2017-02-20 07:12:34 UTC

bug #1421563 is resolved, but still blocked by #1420219

Comment 34 Peng Li 2017-02-27 09:57:03 UTC

Tested with
/openshift3/logging-fluentd  3.5.0  47057624ecab  4 weeks ago 233.1 MB

under fluentd/configs.d/openshift/, the 4 conf files contains

      reload_connections false
      reload_on_failure false

so that fluentd will not ever attempt to reload connections to elasticsearch.

@rich, qe can't create AWS Elasticsearch instance, our account don't have it. To verify this bug, can we use a standalone Elasticsearch instead?

Comment 35 Rich Megginson 2017-02-28 01:33:15 UTC

@Takayoshi - any ideas about how to reproduce without using AWS Elasticsearch?

@Peng Li - not sure - you could try having fluentd run for 3 days under a constant load

Comment 36 Peng Li 2017-02-28 10:08:28 UTC

will try on OpenStack for 3 days running, currently we have an OCP installation blocker, will do the test once it's fixed.

Comment 37 Takayoshi Tanaka 2017-02-28 23:10:20 UTC

@Rich It's difficult to reproduce this issue except for AWS elasticsearch. The one of the cause is below as I commented the first description.

```
AWS Elasticsearch is a managed service. It does not support any administration related API requests. Users can not directly connect/access elastic search nodes. 

sniffer.hosts tries to retrieve nodes from HTTP API endpoint _nodes. In above snippet, you can remove sniffer.hosts and provide AWS elastic search endpoint. In this case, your Ruby elastic search client will always make connection using same AWS elastic search endpoint which will solve your problem.
```

The standard elasticsearch won't effect this issue even if it runs for a long time.  The one possibility to reproduce this issue is setting up elasticsearch cluster and proxy and you made the proxy not responding the value of "_nodes". However, I have no idea how to set up such a cluster.

Again, if you can use AWS elasticsearch, how to reproduce the issue is:
- to send logs not to standard elasticsearch but to AWS elasticsearch
AND
- to keep on sending logs about a day (in my case 18~24 hours is enough to reproduce)

Comment 38 Rich Megginson 2017-02-28 23:53:24 UTC

Then, for QE, can we just confirm no regressions, and pass with a SanityCheck?

Comment 39 Peng Li 2017-03-01 10:10:05 UTC

@Rich, can you help to see if below test is proper?

Below test is executed:

1. Install 3.5.0 Logging using ansible, check log can be gathered and showed on Kibana.
2. Test ES-COPY feature, note that now there is no template 'logging-fluentd-template' in 3.5, thus the step is slightly different from the doc
https://docs.openshift.com/container-platform/3.4/install_config/aggregate_logging.html

I modify the daemonset directly.
#oc edit daemonset/logging-fluentd

set ES_COPY related parameters, and verify that each item is copied(send to the same ES twice)


Version info:
# docker images | grep logging
logging-curator                                      3.5.0               8cfcb23f26b6        12 hours ago        211.1 MB
logging-elasticsearch                                3.5.0               d715f4d34ad4        2 weeks ago         399.2 MB
logging-kibana                                       3.5.0               e0ab09c2cbeb        5 weeks ago         342.9 MB
logging-fluentd                                      3.5.0               47057624ecab        5 weeks ago         233.1 MB
logging-auth-proxy                                   3.5.0               139f7943475e        6 weeks ago         220 MB

Comment 40 Peng Li 2017-03-01 10:11:25 UTC

After modify the daemonset, I delete all existing fluentd pods, thus the new created fluentd pods has the change.

Comment 42 Rich Megginson 2017-03-01 22:16:21 UTC

There is an upstream test for test-es-copy.sh:

https://github.com/openshift/origin-aggregated-logging/blob/master/hack/testing/test-es-copy.sh

And as part of the 3.5 CI effort, we are adding back the fluentd template because it is needed for other tests:

https://github.com/openshift/origin-aggregated-logging/pull/336/files#diff-b87fae4d61b8fe601653fe74e8caa472R250

@Peng Li - are you asking if testing the ES copy feature is sufficient to do a SanityCheck on this bz?  I don't know.  Do you have some sort of logging regression test suite you usually run?

Comment 43 Peng Li 2017-03-02 01:28:34 UTC

Regression test passed, thus set status to Verified, version info is mentioned at Comment 39

Comment 45 errata-xmlrpc 2017-04-12 19:17:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.