Bug 1456490 - Wrong version of ruby gem elasticsearch-api configured with fluentd
Summary: Wrong version of ruby gem elasticsearch-api configured with fluentd
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.4.1
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: 3.6.z
Assignee: Rich Megginson
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1500548 1500549
TreeView+ depends on / blocked
 
Reported: 2017-05-29 13:27 UTC by Peter Portante
Modified: 2019-11-04 16:49 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Dependencies in rubygem were wrong Consequence: Older versions of packages were being used Fix: Override the rubygem dependencies and force the use of the correct versions in the rpm packages Result: fluentd uses the correct version of the libraries
Clone Of:
: 1500548 1500549 (view as bug list)
Environment:
Last Closed: 2017-08-10 05:26:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Peter Portante 2017-05-29 13:27:35 UTC
It looks like the wrong version of the elasticsearch-api Ruby gem is configured with fluentd: elasticsearch-api-1.0.18

According to the elasticearch-api README (see https://github.com/elastic/elasticsearch-ruby/tree/master/elasticsearch-api), it looks like we should be using elasticsearch-api-2.x instead.

We see these errors consistently across a number of OpenShift dedicated instances:

2017-05-18 22:54:55 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-05-18 22:54:56 +0000 error_class="NoMethodError" error="undefined method `status' for nil:NilClass" plugin_id="object:f23708"
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:265:in `rescue in perform_request'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:247:in `perform_request'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/client.rb:128:in `perform_request'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-api-1.0.18/lib/elasticsearch/api/actions/bulk.rb:90:in `bulk'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:201:in `send_bulk'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:193:in `block in write_objects'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `each'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `write_objects'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:480:in `write'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:354:in `write_chunk'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:333:in `pop'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:338:in `try_flush'
  2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:149:in `run'

Comment 1 Rich Megginson 2017-05-31 02:54:15 UTC
3.4 is using rubygem-fluent-plugin-elasticsearch-1.9.1:

$ brew list-tagged --latest --inherit rhaos-3.4-rhel-7-candidate|grep fluent-plugin-elasticsearch
rubygem-fluent-plugin-elasticsearch-1.9.1-1.el7  rhaos-3.4-rhel-7-candidate  rmeggins

This has the following dependencies: https://rubygems.org/gems/fluent-plugin-elasticsearch/versions/1.9.1

Runtime Dependencies (3):
elasticsearch < 1.1
excon >= 0
fluentd >= 0.10.43

1.0.18 is the highest version < 1.1:

https://rubygems.org/gems/elasticsearch/versions/1.0.18

Runtime Dependencies (2):
elasticsearch-api = 1.0.18
elasticsearch-transport = 1.0.18

These are our packages:

$ brew list-tagged --latest --inherit rhaos-3.4-rhel-7-candidate|grep rubygem-*-elasticsearch
rubygem-elasticsearch-1.0.18-1.el7        rhaos-3.4-rhel-7-candidate  rmeggins
rubygem-elasticsearch-api-1.0.18-1.el7    rhaos-3.4-rhel-7-candidate  rmeggins
rubygem-elasticsearch-extensions-0.0.15-2.el7aos  rhaos-3.4-rhel-7-candidate  tdawson
rubygem-elasticsearch-transport-1.0.18-1.el7  rhaos-3.4-rhel-7-candidate  rmeggins


I seem to recall this is a bug in the library:
https://github.com/elastic/elasticsearch-ruby/blob/v1.0.18/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L265

 if @retry_on_status.include?(response.status)

The problem is that the code assumes response can never be `nil`.  But if you look in the code block that precedes this, response can be `nil` if https://github.com/elastic/elasticsearch-ruby/blob/v1.0.18/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L249, which means `get_connection` failed.

Comment 2 Peter Portante 2017-05-31 10:33:14 UTC
Rich, do you read something different on the elasticsearch-api README file?

We can assume you are correct about the dependency chain, but it seems pretty clear to me what is documented in the elasticsearch-api README.

My experience with direct clients of ES 1.x and 2.x shares a similar story. I had to change my error handling for 2.x, as my 1.x client can send to ES successfully, but when an error occurs I get stack traces because the handling changes in the responses.

Comment 3 Rich Megginson 2017-05-31 15:30:05 UTC
(In reply to Peter Portante from comment #2)
> Rich, do you read something different on the elasticsearch-api README file?

That README says use elasticsearch api version 2.x for ES 2.x, which is in conflict with what the actual gem spec says.

> 
> We can assume you are correct about the dependency chain, but it seems
> pretty clear to me what is documented in the elasticsearch-api README.

yes

> 
> My experience with direct clients of ES 1.x and 2.x shares a similar story.
> I had to change my error handling for 2.x, as my 1.x client can send to ES
> successfully, but when an error occurs I get stack traces because the
> handling changes in the responses.

Ok, but just to let you know, it might not be as easy as upgrading packages to use a newer version.  That usually pulls in additional version dependencies, which can cause conflicts with other dependencies, as well as requiring newer versions of ruby . . .

Comment 4 Rich Megginson 2017-06-02 21:56:19 UTC
Is there a reproducer?

Comment 5 Rich Megginson 2017-06-03 02:20:38 UTC
I mean, just killing elasticsearch is not enough to make fluentd cause this error.  I see a different error.

Is this bug related to https://bugzilla.redhat.com/show_bug.cgi?id=1399388?  In both cases, the problem is triggered when `get_connection` fails or returns nil.

Comment 6 Rich Megginson 2017-06-03 02:32:16 UTC
reading through https://bugzilla.redhat.com/show_bug.cgi?id=1399388 - I was able to reproduce - looks like we need to backport that fix to 3.4

Comment 7 Peter Portante 2017-06-04 12:29:49 UTC
Yes, we need this commit [1] to remove the "< 1.1" version dependency so that we can provide the versions that work with 2.x of Elasticsearch.

This appears to be v1.9.3 and later for the fluentd-plugin-elasticsearch gem.

[1] https://github.com/uken/fluent-plugin-elasticsearch/commit/b73cc0e21d8558d536c7819adcc3eb3b89ec5368

Comment 8 Rich Megginson 2017-06-05 13:33:11 UTC
(In reply to Peter Portante from comment #7)
> Yes, we need this commit [1] to remove the "< 1.1" version dependency so
> that we can provide the versions that work with 2.x of Elasticsearch.
> 
> This appears to be v1.9.3 and later for the fluentd-plugin-elasticsearch gem.
> 
> [1]
> https://github.com/uken/fluent-plugin-elasticsearch/commit/
> b73cc0e21d8558d536c7819adcc3eb3b89ec5368

This is risky.

The simpler fix is to backport the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1399388

Comment 9 Peter Portante 2017-06-05 13:58:39 UTC
We are using the wrong version of the ruby elasticsearch API to talk to a v2 elasticsearch image, do you agree with that statement?

If you do, how does that fix [1] solve the problem statement above?

If you don't agree with that statement, then we need to be sure we all understand the core issue being discussed here.

[1] https://github.com/openshift/origin-aggregated-logging/pull/309/files

Comment 10 Rich Megginson 2017-06-05 14:42:37 UTC
(In reply to Peter Portante from comment #9)
> We are using the wrong version of the ruby elasticsearch API to talk to a v2
> elasticsearch image, do you agree with that statement?

The README says use elasticsearch api version 2.x for ES 2.x, which is in conflict with what the actual gem spec says.

> 
> If you do, how does that fix [1] solve the problem statement above?

It solves the problem where the code cannot properly handle a `get_connection` exception where `response` is `nil`.  It does this by making it so that we never reload connections, and erase the connection list, causing the situation where `get_connection` can return an error.

If you read through https://bugzilla.redhat.com/show_bug.cgi?id=1399388 you can see that I was able to reproduce this bz:

https://bugzilla.redhat.com/show_bug.cgi?id=1399388#c17

and QE verified that it fixed the problem.

> 
> If you don't agree with that statement, then we need to be sure we all
> understand the core issue being discussed here.
> 
> [1] https://github.com/openshift/origin-aggregated-logging/pull/309/files

I'm already confident that the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1399388 will fix this problem, that it is low risk to backport it, and that it can be done fairly quickly.

Upgrading the libraries to 2.x - I'm not confident that it will solve the problem, I think it is risky, and I think that it will not be done in a timely manner.

Peter, if you can prove that upgrading to 2.x will definitively solve this bz, then I am willing to do all of the packaging, CI, and documentation work.

Comment 11 Rich Megginson 2017-06-05 16:11:54 UTC
Ok.  After a discussion, it was determined that the purpose of this bz is not to solve any particular issue, but to ensure that we are using the correct version of the libraries.

Comment 15 Junqi Zhao 2017-06-29 01:09:50 UTC
@rmeggins

The status will be changed to ON_QA once the defect is added to Errata, you can change to other status if this defect is not ready for testing

Comment 16 Xia Zhao 2017-06-29 01:59:24 UTC
Set back to assigned based on comment #14, thanks Rich for the info.

Comment 17 Rich Megginson 2017-07-06 15:34:16 UTC
The new fluentd image has the following packages:

rubygem-elasticsearch-api-2.0.2-1.el7.noarch
rubygem-elasticsearch-transport-2.0.2-1.el7.noarch
rubygem-elasticsearch-2.0.2-1.el7.noarch
rubygem-fluent-plugin-elasticsearch-1.9.5-2.el7.noarch

new images are

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=571148
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-28803-20170706152514
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.136
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.136-2

Comment 18 Xia Zhao 2017-07-07 05:31:04 UTC
Passed verification on openshift v3.6.136:

$ oc rsh logging-fluentd-4fndr
sh-4.2# rpm -qa | grep rubygem | grep elasticsearch
rubygem-elasticsearch-transport-2.0.2-1.el7.noarch
rubygem-elasticsearch-api-2.0.2-1.el7.noarch
rubygem-elasticsearch-2.0.2-1.el7.noarch
rubygem-fluent-plugin-elasticsearch-1.9.5-2.el7.noarch

Image tested with:
logging-fluentd         v3.6                420c8b54bdf8        14 hours ago        231.5 MB

Comment 19 Peter Portante 2017-08-01 03:43:25 UTC
We need this backported to 3.5 and 3.4, as this is masking log loss in those environments.

Comment 20 Rich Megginson 2017-08-01 12:58:23 UTC
(In reply to Peter Portante from comment #19)
> We need this backported to 3.5 and 3.4, as this is masking log loss in those
> environments.

My concern is that we need to fix the log loss problem at the same time, or we are going to be hammered by 3.4 and 3.5 customer calls when their fluentds start outputting lots of "error status=429" bulk index rejection messages, just as we are seeing in 3.6.  Do we have a plan for that?

Comment 21 Peter Portante 2017-08-02 01:53:15 UTC
(In reply to Rich Megginson from comment #20)
> (In reply to Peter Portante from comment #19)
> > We need this backported to 3.5 and 3.4, as this is masking log loss in those
> > environments.
> 
> My concern is that we need to fix the log loss problem at the same time, or
> we are going to be hammered by 3.4 and 3.5 customer calls when their
> fluentds start outputting lots of "error status=429" bulk index rejection
> messages, just as we are seeing in 3.6.

Agreed.

> Do we have a plan for that?

We are working on that ...

Comment 23 errata-xmlrpc 2017-08-10 05:26:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Comment 24 Peter Portante 2017-10-11 01:29:30 UTC
This only made it to 3.6.z apparently.  It does not appear to be backported to 3.4 and 3.5.


Note You need to log in before you can comment on or make changes to this bug.