Description of problem: While investigating on starter-ca-central-1 I noticed every fluent pod was stuck and unable to connect to elasticsearch. Each pod had the following stack over and over 2018-07-12 22:34:10 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-07-12 22:34:13 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3fe35368fd64" 2018-07-12 22:34:10 +0000 [warn]: suppressed same stacktrace curling the enpoint from the pod showed success. After rolling back the fluent images to the latest 3.9 logs started flowing again. I believe the issue is the difference with the faraday gem: 3.9 has rubygem-faraday-0.13.1-1.el7.noarch.rpm 3.10 has rubygem-faraday-0.15.1-1.el7.noarch.rpm The newer one is bad and should be reverted for 3.10
@jeff, Can we attached the stuck buffer files, so we can use them during reproduce and verify?
I just hit comment 2 on my long running test. The pipeline was stalled with no new entries in the ES indicies and /var/lib/fluentd full with 33 buffers (max for my buffersize/message size). Re-starting fluentd caused the buffers to be drained and the indices were updated. I need to restart the test to try to catch when the buffers start accumulating. I did not see ConnectionFailure mentioned in the description of this bz.
I'm now uncertain if this gem is completely the culprit. Diff between dependencies: # diff gems3933 gems3109 7,12c7,12 < domain_name-0.5.20170404 < elasticsearch-2.0.2 < elasticsearch-api-2.0.2 < elasticsearch-transport-2.0.2 < excon-0.60.0 < faraday-0.13.1 --- > domain_name-0.5.20180417 > elasticsearch-5.0.5 > elasticsearch-api-5.0.5 > elasticsearch-transport-5.0.5 > excon-0.62.0 > faraday-0.15.1 24,25c24,25 < fluent-plugin-systemd-0.0.9 < fluent-plugin-viaq_data_model-0.0.13 --- > fluent-plugin-systemd-0.0.10 > fluent-plugin-viaq_data_model-0.0.14 38c38 < msgpack-1.2.2 --- > msgpack-1.2.4 55c55 < tzinfo-data-1.2018.3 --- > tzinfo-data-1.2018.5 59c59 < yajl-ruby-1.3.1 --- > yajl-ruby-1.4.0 The other outlier is the elasticsearch-5.0.5 which was added to support 5.x which we are no longer supporting for the 3.10 release.
Something doesn't add up. We've never seen this issue in 3.10 upstream or downstream in our developer, QE, CI, or performance testing. I'm trying to reproduce this with several different versions of the v3.10.x logging-fluentd image - all work correctly. The problem is not only the version of faraday but also must be something else. How can we reproduce this issue so that we can debug it and identify the root cause?
Per my conversation with Harrison, moving to 3.10.z since this is not a testblocker
verified with logging 3.10.28. Issue is no longer seen.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2376