Bug 1506854
| Summary: | default fluentd elasticsearch plugin request timeout too short by default, leads to potential log loss and stalled log flow | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Xiaoli Tian <xtian> |
| Component: | Logging | Assignee: | Rich Megginson <rmeggins> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.4.1 | CC: | anli, aos-bugs, jcantril, pportant, rmeggins, rromerom, tkatarki |
| Target Milestone: | --- | Keywords: | OpsBlocker, Reopened |
| Target Release: | 3.4.z | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: If the logging system is under a heavy load, it may take longer than the 5 second timeout for Elasticsearch to respond, or it may respond with an error indicating that Fluentd needs to backoff.
Consequence: In the former case, Fluentd will retry to send the records again, which can lead to having duplicate records. In the latter case, if Fluentd is unable to retry, it will drop records, leading to data loss.
Fix: For the former case, the fix is to set the request_timeout to 10 minutes, so that Fluentd will wait up to 10 minutes for the reply from Elasticsearch before retrying the request. In the latter case, Fluentd will block attempting to read more input, until the output queues and buffers have enough room to write more data.
Result: Greatly reduced chances of duplicate data (but not entirely eliminated). No data loss due to backpressure.
|
Story Points: | --- |
| Clone Of: | 1497836 | Environment: | |
| Last Closed: | 2017-12-07 07:14:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1497836 | ||
| Bug Blocks: | |||
|
Comment 2
Anping Li
2017-10-27 11:39:18 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3389 |