Bug 1600740 - Logging fluentd image has bad faraday gem
Summary: Logging fluentd image has bad faraday gem
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.10.z
Assignee: Jeff Cantrill
QA Contact: Qiaoling Tang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-12 22:41 UTC by Jeff Cantrill
Modified: 2018-08-31 06:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-08-31 06:18:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2376 0 None None None 2018-08-31 06:18:51 UTC

Description Jeff Cantrill 2018-07-12 22:41:28 UTC
Description of problem:

While investigating on starter-ca-central-1 I noticed every fluent pod was stuck and unable to connect to elasticsearch.  Each pod had the following stack over and over

2018-07-12 22:34:10 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-07-12 22:34:13 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3fe35368fd64"
  2018-07-12 22:34:10 +0000 [warn]: suppressed same stacktrace


curling the enpoint from the pod showed success.  After rolling back the fluent images to the latest 3.9 logs started flowing again.  I believe the issue is the difference with the faraday gem:

3.9 has rubygem-faraday-0.13.1-1.el7.noarch.rpm   
3.10 has rubygem-faraday-0.15.1-1.el7.noarch.rpm   

The newer one is bad and should be reverted for 3.10

Comment 1 Anping Li 2018-07-13 09:26:38 UTC
@jeff, Can we attached the stuck buffer files, so we can use them during reproduce and verify?

Comment 3 Mike Fiedler 2018-07-13 12:03:09 UTC
I just hit comment 2 on my long running test.   The pipeline was stalled with no new entries in the ES indicies and /var/lib/fluentd full with 33 buffers (max for my buffersize/message size).   

Re-starting fluentd caused the buffers to be drained and the indices were updated.  I need to restart the test to try to catch when the buffers start accumulating.

I did not see ConnectionFailure mentioned in the description of this bz.

Comment 5 Jeff Cantrill 2018-07-13 19:00:52 UTC
I'm now uncertain if this gem is completely the culprit.  Diff between dependencies:

# diff gems3933 gems3109
7,12c7,12
< domain_name-0.5.20170404
< elasticsearch-2.0.2
< elasticsearch-api-2.0.2
< elasticsearch-transport-2.0.2
< excon-0.60.0
< faraday-0.13.1
---
> domain_name-0.5.20180417
> elasticsearch-5.0.5
> elasticsearch-api-5.0.5
> elasticsearch-transport-5.0.5
> excon-0.62.0
> faraday-0.15.1
24,25c24,25
< fluent-plugin-systemd-0.0.9
< fluent-plugin-viaq_data_model-0.0.13
---
> fluent-plugin-systemd-0.0.10
> fluent-plugin-viaq_data_model-0.0.14
38c38
< msgpack-1.2.2
---
> msgpack-1.2.4
55c55
< tzinfo-data-1.2018.3
---
> tzinfo-data-1.2018.5
59c59
< yajl-ruby-1.3.1
---
> yajl-ruby-1.4.0


The other outlier is the elasticsearch-5.0.5 which was added to support 5.x which we are no longer supporting for the 3.10 release.

Comment 7 Rich Megginson 2018-07-13 21:06:42 UTC
Something doesn't add up.

We've never seen this issue in 3.10 upstream or downstream in our developer, QE, CI, or performance testing.

I'm trying to reproduce this with several different versions of the v3.10.x logging-fluentd image - all work correctly.  The problem is not only the version of faraday but also must be something else.

How can we reproduce this issue so that we can debug it and identify the root cause?

Comment 12 Jeff Cantrill 2018-07-19 13:10:57 UTC
Per my conversation with Harrison, moving to 3.10.z since this is not a testblocker

Comment 14 Mike Fiedler 2018-08-21 13:16:33 UTC
verified with logging 3.10.28.  Issue is no longer seen.

Comment 16 errata-xmlrpc 2018-08-31 06:18:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2376


Note You need to log in before you can comment on or make changes to this bug.