It looks like the wrong version of the elasticsearch-api Ruby gem is configured with fluentd: elasticsearch-api-1.0.18 According to the elasticearch-api README (see https://github.com/elastic/elasticsearch-ruby/tree/master/elasticsearch-api), it looks like we should be using elasticsearch-api-2.x instead. We see these errors consistently across a number of OpenShift dedicated instances: 2017-05-18 22:54:55 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-05-18 22:54:56 +0000 error_class="NoMethodError" error="undefined method `status' for nil:NilClass" plugin_id="object:f23708" 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:265:in `rescue in perform_request' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:247:in `perform_request' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/client.rb:128:in `perform_request' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/elasticsearch-api-1.0.18/lib/elasticsearch/api/actions/bulk.rb:90:in `bulk' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:201:in `send_bulk' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:193:in `block in write_objects' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `each' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.1/lib/fluent/plugin/out_elasticsearch_dynamic.rb:192:in `write_objects' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:480:in `write' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:354:in `write_chunk' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/buffer.rb:333:in `pop' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:338:in `try_flush' 2017-05-18 22:54:55 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.29/lib/fluent/output.rb:149:in `run'
3.4 is using rubygem-fluent-plugin-elasticsearch-1.9.1: $ brew list-tagged --latest --inherit rhaos-3.4-rhel-7-candidate|grep fluent-plugin-elasticsearch rubygem-fluent-plugin-elasticsearch-1.9.1-1.el7 rhaos-3.4-rhel-7-candidate rmeggins This has the following dependencies: https://rubygems.org/gems/fluent-plugin-elasticsearch/versions/1.9.1 Runtime Dependencies (3): elasticsearch < 1.1 excon >= 0 fluentd >= 0.10.43 1.0.18 is the highest version < 1.1: https://rubygems.org/gems/elasticsearch/versions/1.0.18 Runtime Dependencies (2): elasticsearch-api = 1.0.18 elasticsearch-transport = 1.0.18 These are our packages: $ brew list-tagged --latest --inherit rhaos-3.4-rhel-7-candidate|grep rubygem-*-elasticsearch rubygem-elasticsearch-1.0.18-1.el7 rhaos-3.4-rhel-7-candidate rmeggins rubygem-elasticsearch-api-1.0.18-1.el7 rhaos-3.4-rhel-7-candidate rmeggins rubygem-elasticsearch-extensions-0.0.15-2.el7aos rhaos-3.4-rhel-7-candidate tdawson rubygem-elasticsearch-transport-1.0.18-1.el7 rhaos-3.4-rhel-7-candidate rmeggins I seem to recall this is a bug in the library: https://github.com/elastic/elasticsearch-ruby/blob/v1.0.18/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L265 if @retry_on_status.include?(response.status) The problem is that the code assumes response can never be `nil`. But if you look in the code block that precedes this, response can be `nil` if https://github.com/elastic/elasticsearch-ruby/blob/v1.0.18/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L249, which means `get_connection` failed.
Rich, do you read something different on the elasticsearch-api README file? We can assume you are correct about the dependency chain, but it seems pretty clear to me what is documented in the elasticsearch-api README. My experience with direct clients of ES 1.x and 2.x shares a similar story. I had to change my error handling for 2.x, as my 1.x client can send to ES successfully, but when an error occurs I get stack traces because the handling changes in the responses.
(In reply to Peter Portante from comment #2) > Rich, do you read something different on the elasticsearch-api README file? That README says use elasticsearch api version 2.x for ES 2.x, which is in conflict with what the actual gem spec says. > > We can assume you are correct about the dependency chain, but it seems > pretty clear to me what is documented in the elasticsearch-api README. yes > > My experience with direct clients of ES 1.x and 2.x shares a similar story. > I had to change my error handling for 2.x, as my 1.x client can send to ES > successfully, but when an error occurs I get stack traces because the > handling changes in the responses. Ok, but just to let you know, it might not be as easy as upgrading packages to use a newer version. That usually pulls in additional version dependencies, which can cause conflicts with other dependencies, as well as requiring newer versions of ruby . . .
Is there a reproducer?
I mean, just killing elasticsearch is not enough to make fluentd cause this error. I see a different error. Is this bug related to https://bugzilla.redhat.com/show_bug.cgi?id=1399388? In both cases, the problem is triggered when `get_connection` fails or returns nil.
reading through https://bugzilla.redhat.com/show_bug.cgi?id=1399388 - I was able to reproduce - looks like we need to backport that fix to 3.4
Yes, we need this commit [1] to remove the "< 1.1" version dependency so that we can provide the versions that work with 2.x of Elasticsearch. This appears to be v1.9.3 and later for the fluentd-plugin-elasticsearch gem. [1] https://github.com/uken/fluent-plugin-elasticsearch/commit/b73cc0e21d8558d536c7819adcc3eb3b89ec5368
(In reply to Peter Portante from comment #7) > Yes, we need this commit [1] to remove the "< 1.1" version dependency so > that we can provide the versions that work with 2.x of Elasticsearch. > > This appears to be v1.9.3 and later for the fluentd-plugin-elasticsearch gem. > > [1] > https://github.com/uken/fluent-plugin-elasticsearch/commit/ > b73cc0e21d8558d536c7819adcc3eb3b89ec5368 This is risky. The simpler fix is to backport the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1399388
We are using the wrong version of the ruby elasticsearch API to talk to a v2 elasticsearch image, do you agree with that statement? If you do, how does that fix [1] solve the problem statement above? If you don't agree with that statement, then we need to be sure we all understand the core issue being discussed here. [1] https://github.com/openshift/origin-aggregated-logging/pull/309/files
(In reply to Peter Portante from comment #9) > We are using the wrong version of the ruby elasticsearch API to talk to a v2 > elasticsearch image, do you agree with that statement? The README says use elasticsearch api version 2.x for ES 2.x, which is in conflict with what the actual gem spec says. > > If you do, how does that fix [1] solve the problem statement above? It solves the problem where the code cannot properly handle a `get_connection` exception where `response` is `nil`. It does this by making it so that we never reload connections, and erase the connection list, causing the situation where `get_connection` can return an error. If you read through https://bugzilla.redhat.com/show_bug.cgi?id=1399388 you can see that I was able to reproduce this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1399388#c17 and QE verified that it fixed the problem. > > If you don't agree with that statement, then we need to be sure we all > understand the core issue being discussed here. > > [1] https://github.com/openshift/origin-aggregated-logging/pull/309/files I'm already confident that the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1399388 will fix this problem, that it is low risk to backport it, and that it can be done fairly quickly. Upgrading the libraries to 2.x - I'm not confident that it will solve the problem, I think it is risky, and I think that it will not be done in a timely manner. Peter, if you can prove that upgrading to 2.x will definitively solve this bz, then I am willing to do all of the packaging, CI, and documentation work.
Ok. After a discussion, it was determined that the purpose of this bz is not to solve any particular issue, but to ensure that we are using the correct version of the libraries.
https://github.com/openshift/origin-aggregated-logging/commit/59d3e59f3759e855367024a36cb75e1ee0aefb49
@rmeggins The status will be changed to ON_QA once the defect is added to Errata, you can change to other status if this defect is not ready for testing
Set back to assigned based on comment #14, thanks Rich for the info.
The new fluentd image has the following packages: rubygem-elasticsearch-api-2.0.2-1.el7.noarch rubygem-elasticsearch-transport-2.0.2-1.el7.noarch rubygem-elasticsearch-2.0.2-1.el7.noarch rubygem-fluent-plugin-elasticsearch-1.9.5-2.el7.noarch new images are koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=571148 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-28803-20170706152514 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.136 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.136-2
Passed verification on openshift v3.6.136: $ oc rsh logging-fluentd-4fndr sh-4.2# rpm -qa | grep rubygem | grep elasticsearch rubygem-elasticsearch-transport-2.0.2-1.el7.noarch rubygem-elasticsearch-api-2.0.2-1.el7.noarch rubygem-elasticsearch-2.0.2-1.el7.noarch rubygem-fluent-plugin-elasticsearch-1.9.5-2.el7.noarch Image tested with: logging-fluentd v3.6 420c8b54bdf8 14 hours ago 231.5 MB
We need this backported to 3.5 and 3.4, as this is masking log loss in those environments.
(In reply to Peter Portante from comment #19) > We need this backported to 3.5 and 3.4, as this is masking log loss in those > environments. My concern is that we need to fix the log loss problem at the same time, or we are going to be hammered by 3.4 and 3.5 customer calls when their fluentds start outputting lots of "error status=429" bulk index rejection messages, just as we are seeing in 3.6. Do we have a plan for that?
(In reply to Rich Megginson from comment #20) > (In reply to Peter Portante from comment #19) > > We need this backported to 3.5 and 3.4, as this is masking log loss in those > > environments. > > My concern is that we need to fix the log loss problem at the same time, or > we are going to be hammered by 3.4 and 3.5 customer calls when their > fluentds start outputting lots of "error status=429" bulk index rejection > messages, just as we are seeing in 3.6. Agreed. > Do we have a plan for that? We are working on that ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716
This only made it to 3.6.z apparently. It does not appear to be backported to 3.4 and 3.5.