Bug 1600740

Summary:	Logging fluentd image has bad faraday gem
Product:	OpenShift Container Platform	Reporter:	Jeff Cantrill <jcantril>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED ERRATA	QA Contact:	Qiaoling Tang <qitang>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.10.0	CC:	anli, aos-bugs, jcantril, mifiedle, rmeggins
Target Milestone:	---
Target Release:	3.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-31 06:18:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Cantrill 2018-07-12 22:41:28 UTC

Description of problem:

While investigating on starter-ca-central-1 I noticed every fluent pod was stuck and unable to connect to elasticsearch.  Each pod had the following stack over and over

2018-07-12 22:34:10 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-07-12 22:34:13 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3fe35368fd64"
  2018-07-12 22:34:10 +0000 [warn]: suppressed same stacktrace


curling the enpoint from the pod showed success.  After rolling back the fluent images to the latest 3.9 logs started flowing again.  I believe the issue is the difference with the faraday gem:

3.9 has rubygem-faraday-0.13.1-1.el7.noarch.rpm   
3.10 has rubygem-faraday-0.15.1-1.el7.noarch.rpm   

The newer one is bad and should be reverted for 3.10

Comment 1 Anping Li 2018-07-13 09:26:38 UTC

@jeff, Can we attached the stuck buffer files, so we can use them during reproduce and verify?

Comment 3 Mike Fiedler 2018-07-13 12:03:09 UTC

I just hit comment 2 on my long running test.   The pipeline was stalled with no new entries in the ES indicies and /var/lib/fluentd full with 33 buffers (max for my buffersize/message size).   

Re-starting fluentd caused the buffers to be drained and the indices were updated.  I need to restart the test to try to catch when the buffers start accumulating.

I did not see ConnectionFailure mentioned in the description of this bz.

Comment 5 Jeff Cantrill 2018-07-13 19:00:52 UTC

I'm now uncertain if this gem is completely the culprit.  Diff between dependencies:

# diff gems3933 gems3109
7,12c7,12
< domain_name-0.5.20170404
< elasticsearch-2.0.2
< elasticsearch-api-2.0.2
< elasticsearch-transport-2.0.2
< excon-0.60.0
< faraday-0.13.1
---
> domain_name-0.5.20180417
> elasticsearch-5.0.5
> elasticsearch-api-5.0.5
> elasticsearch-transport-5.0.5
> excon-0.62.0
> faraday-0.15.1
24,25c24,25
< fluent-plugin-systemd-0.0.9
< fluent-plugin-viaq_data_model-0.0.13
---
> fluent-plugin-systemd-0.0.10
> fluent-plugin-viaq_data_model-0.0.14
38c38
< msgpack-1.2.2
---
> msgpack-1.2.4
55c55
< tzinfo-data-1.2018.3
---
> tzinfo-data-1.2018.5
59c59
< yajl-ruby-1.3.1
---
> yajl-ruby-1.4.0


The other outlier is the elasticsearch-5.0.5 which was added to support 5.x which we are no longer supporting for the 3.10 release.

Comment 7 Rich Megginson 2018-07-13 21:06:42 UTC

Something doesn't add up.

We've never seen this issue in 3.10 upstream or downstream in our developer, QE, CI, or performance testing.

I'm trying to reproduce this with several different versions of the v3.10.x logging-fluentd image - all work correctly.  The problem is not only the version of faraday but also must be something else.

How can we reproduce this issue so that we can debug it and identify the root cause?

Comment 12 Jeff Cantrill 2018-07-19 13:10:57 UTC

Per my conversation with Harrison, moving to 3.10.z since this is not a testblocker

Comment 14 Mike Fiedler 2018-08-21 13:16:33 UTC

verified with logging 3.10.28.  Issue is no longer seen.

Comment 16 errata-xmlrpc 2018-08-31 06:18:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2376