Bug 1489533
Summary: | logging-fluentd needs to periodically reconnect to logging-mux or elasticsearch to help balance sessions | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | |
Component: | Logging | Assignee: | Rich Megginson <rmeggins> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Fiedler <mifiedle> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.6.0 | CC: | aos-bugs, bperkins, jcantril, pportant, rmeggins, tkatarki | |
Target Milestone: | --- | Keywords: | Reopened | |
Target Release: | 3.11.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | openshift3/ose-logging-fluentd:v3.11.0-0.16.0 | Doc Type: | Enhancement | |
Doc Text: |
Feature: Fluentd will now reconnect to Elasticsearch every 100 operations by default.
Reason: If one Elasticsearch is started before the others in the cluster, the load balancer in the Elasticsearch service will connect to that one and that one only, and so will all of the Fluentd connecting to Elasticsearch.
Result: By having Fluentd reconnect periodically, the load balancer will be able to spread the load evenly among all of the Elasticsearch in the cluster.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1616352 (view as bug list) | Environment: | ||
Last Closed: | 2018-12-21 15:16:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1616352, 1616354 |
Description
Mike Fiedler
2017-09-07 16:02:01 UTC
I think we might be disabling reconnecting by default, see https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/output-es-config.conf#L19 The non-mux case has the same issue. With a 3 node elasticsearch cluster, if an ES deploymentconfig is scaled down and back up the new ES pod will never get sessions from any fluentd clients. Changing the summary of this bz - the core issue is fluentd never reconnects to help with session spreading. Closing in favor of RFE trello card *** Bug 1448951 has been marked as a duplicate of this bug. *** I agree with Mike, tracking this as a trello card is worth while, but the bug appears to be present with all versions of aggregated logging which use fluentd. Seems like we need to keep this open, and clone this to all the versions we support. Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/f084fee4de8c32f83c53694058320e6dc3e5d170 Bug 1489533 - logging-fluentd needs to periodically reconnect to logging-mux or elasticsearch to help balance sessions https://bugzilla.redhat.com/show_bug.cgi?id=1489533 https://github.com/uken/fluent-plugin-elasticsearch/pull/459 implements support for reloading connections when the Elasticsearch is behind a proxy/load balancer, as in our case, and allows specifying the reload interval in terms of the number of operations. This PR adds support for the following env. vars which can be set in the fluentd daemonset/mux deployment. The ability to set these is provided primarily for experimentation, not something which will ordinarily require tuning in production. `ES_RELOAD_CONNECTIONS` - boolean - default `true` `ES_RELOAD_AFTER` - integer - default `100` `ES_SNIFFER_CLASS_NAME` - string - default `Fluent::Plugin::ElasticsearchSimpleSniffer` There are also `OPS_` named env. vars which will override the corresponding `ES_` named env. var. That is, by default, fluentd will reload connections to Elasticsearch every 100 operations (NOTE: not every 100 records!) These include internal `ping` operations, so will not exactly correspond to each bulk index request. https://github.com/openshift/origin-aggregated-logging/commit/0ecf76a77627c2205f78da6c9ace4dbdc6b72197 Merge pull request #1284 from richm/bug-1489533 Bug 1489533 - logging-fluentd needs to periodically reconnect to logging-mux or elasticsearch to help balance sessions Tested this with varying workloads from 50 to 700 messages/second/node from 100 pods per node, each in its own namespace. Tested with RELOAD off, default (100 operations) and 250 operations. For the highest workload (700 1Kb messages/second/node), fluentd cpu utilization: RELOAD off: 48% RELOAD 100 operations: 52% RELOAD 250 operations: 49% For a workload of 250 messages/second/node RELOAD off: 19% RELOAD 100 operations: 22% RELOAD 250 operations: 21% Different RELOAD levels had no impact on fluentd memory utilization Different RELOAD levels had no impact on elasticsearch cpu or memory. Leaving it at 100 operations seems reasonable, but defaulting to 200 or 250 might provide some marginal cpu utilization savings. @rmeggins, opinion on upping the default reload to 200 operations? (In reply to Mike Fiedler from comment #17) > @rmeggins, opinion on upping the default reload to 200 operations? Sure, sounds good. Verified on 3.11.0-0.25.0. Verified on a 500 node cluster that logging connections are spread evenly across ES systems and that re-connections occur. Will leave it to dev to decide if the default of 100 should change based on the data in comment 16. (In reply to Mike Fiedler from comment #19) > Verified on 3.11.0-0.25.0. Verified on a 500 node cluster that logging > connections are spread evenly across ES systems and that re-connections > occur. Will leave it to dev to decide if the default of 100 should change > based on the data in comment 16. openshift/origin-aggregated-logging/pull/1341 Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content. |