Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1891623

Summary: Fluentd constantly generating queue length pending alerts
Product: OpenShift Container Platform Reporter: Neil Girard <ngirard>
Component: LoggingAssignee: Vitalii Parfonov <vparfono>
Status: CLOSED ERRATA QA Contact: Qiaoling Tang <qitang>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: achakrat, aos-bugs, jcantril, kelly.brown1, ocasalsa, periklis, pescorza, qitang, rrackow, ssonigra, steven.barre, stwalter, vparfono
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-01 12:03:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Neil Girard 2020-10-26 21:05:41 UTC
Description of problem:
A clean install of OCP 4.5 with default logging enabled is generating FluentdQueuelengthIncreasing Pending events continuously but never firing.  The system is not even logging audit information and not apps are installed (only core OCP install).


Version-Release number of selected component (if applicable):
OCP: 4.5
CLO: 4.5.0-202010090328.p0-b348f79
EO: 4.5.0-202010081312.p0-c6a2ddc

How reproducible:
Customer reproduces with several installs

Steps to Reproduce:
1. Install default CLO w/ default settings.

Actual results:
nodes seem to fluctuate which has the alert.

Expected results:
No alert

Additional info:

System Info:
* VMware ESXi, 6.7.0, 16773714
* UCSB-B200-M4/M5
* Masters 8CPU/32GB, Infra/Worker: 10 CPU and 64 GB

Comment 2 Jeff Cantrill 2020-10-26 22:20:39 UTC
Setting to Medium and targeting 4.7.  This alert could be firing prematurely because we did make some adjustments to allow more buffers and the alert wasnt adjusted accordingly.  Additionally, this could be an indicator that fluent isn't able to push logs fast enough.  That might be a storage issue but it's hard to say without additional information.  The fact it's firing means we simply may not be able to collect logs fast enough and https://bugzilla.redhat.com/show_bug.cgi?id=1872465 is already addressing collector performance

Comment 3 Neil Girard 2020-10-27 15:55:18 UTC
Any specific data collection you would like me to collect to aid in this?

Comment 4 Jeff Cantrill 2020-12-15 21:56:13 UTC
(In reply to Neil Girard from comment #3)
> Any specific data collection you would like me to collect to aid in this?

I don't think so.  I have an open discussion with our team and we may remove the alert FluentdQueueLengthBurst as AFAIK it's not actionable.  It is an indicator there was a spike in log generation by some service.  You could easily see this alert when a fresh stack is installed and fluent is waiting to collecting logs but is unable to push logs to ES because the ES cluster is starting.  The alert you ref in #c0 FluentdQueuelengthIncreasing is more interesting though as it means the collector is unable to push logs faster then it can collect.  Note it is in pending which means the alert hasn't actually fired. It is an indicator, however, that admins may need to check the outputs to ensure they are available.  The only real changes you could make to the collector is to give it more CPU but that has limitations.  The collector  is ruby based and can only process logs so fast.

Comment 5 Vitalii Parfonov 2021-02-23 08:57:40 UTC
After discussions with the team we can say that pending alerts is not an alert or an issue it just means that the Prometheus query has evaluated to true for at least one interval, so you can skip it.

Comment 13 Qiaoling Tang 2021-05-31 03:11:40 UTC
Verified with clusterlogging.4.6.0-202105210952.p0.

Comment 15 errata-xmlrpc 2021-06-01 12:03:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.31 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2102