Bug 1891623
| Summary: | Fluentd constantly generating queue length pending alerts | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Neil Girard <ngirard> |
| Component: | Logging | Assignee: | Vitalii Parfonov <vparfono> |
| Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.5 | CC: | achakrat, aos-bugs, jcantril, kelly.brown1, ocasalsa, periklis, pescorza, qitang, rrackow, ssonigra, steven.barre, stwalter, vparfono |
| Target Milestone: | --- | Keywords: | ServiceDeliveryImpact |
| Target Release: | 4.6.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | logging-core | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-01 12:03:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Neil Girard
2020-10-26 21:05:41 UTC
Setting to Medium and targeting 4.7. This alert could be firing prematurely because we did make some adjustments to allow more buffers and the alert wasnt adjusted accordingly. Additionally, this could be an indicator that fluent isn't able to push logs fast enough. That might be a storage issue but it's hard to say without additional information. The fact it's firing means we simply may not be able to collect logs fast enough and https://bugzilla.redhat.com/show_bug.cgi?id=1872465 is already addressing collector performance Any specific data collection you would like me to collect to aid in this? (In reply to Neil Girard from comment #3) > Any specific data collection you would like me to collect to aid in this? I don't think so. I have an open discussion with our team and we may remove the alert FluentdQueueLengthBurst as AFAIK it's not actionable. It is an indicator there was a spike in log generation by some service. You could easily see this alert when a fresh stack is installed and fluent is waiting to collecting logs but is unable to push logs to ES because the ES cluster is starting. The alert you ref in #c0 FluentdQueuelengthIncreasing is more interesting though as it means the collector is unable to push logs faster then it can collect. Note it is in pending which means the alert hasn't actually fired. It is an indicator, however, that admins may need to check the outputs to ensure they are available. The only real changes you could make to the collector is to give it more CPU but that has limitations. The collector is ruby based and can only process logs so fast. After discussions with the team we can say that pending alerts is not an alert or an issue it just means that the Prometheus query has evaluated to true for at least one interval, so you can skip it. Verified with clusterlogging.4.6.0-202105210952.p0. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.31 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2102 |