1846174 – [Logging 4.5] fluentd gets wedged/stops sending to ES at moderately high rates with latest images

Bug 1846174 - [Logging 4.5] fluentd gets wedged/stops sending to ES at moderately high rates with latest images

Summary: [Logging 4.5] fluentd gets wedged/stops sending to ES at moderately high rate...

Keywords:
Status:	CLOSED DUPLICATE of bug 1833486
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Jeff Cantrill
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1846094
Blocks:	1847027
TreeView+	depends on / blocked

Reported:	2020-06-11 00:17 UTC by Mike Fiedler
Modified:	2020-08-12 17:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1846094
Environment:
Last Closed:	2020-07-07 19:54:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 1 Jeff Cantrill 2020-06-11 19:26:31 UTC

Moving to a z stream for resolution

Comment 4 Jeff Cantrill 2020-06-16 21:39:04 UTC

> 2020-06-15 11:32:07 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block 

This log indicates that fluent is blocking on reading more logs because it is unable to buffer them.  It is possible the tailed filed being rolled out from beneath fluent and never sees some logs. I'm not convinced that fluent has stopped pushing messages as maybe the messages are actually lost all together to log collection. This [1] introduced a change to keep logging from bringing down the node but in doing so we may have gone completely the other way.  

We are investigating if by removing the 'queue_limit_length' parameter all together and setting 'queued_chunks_limit_size' to its previous value of '32' addresses the issue.  You can experiment yourself if need by setting the clusterlogging instance to unmanaged, manually editing the conf, deleting the fluent pods to restart them.

I also don't see how much CPU is allocated to fluent.  If it is the default then you should consider increasing it to at least 500m or more.

[1] https://github.com/openshift/cluster-logging-operator/commit/8ceb6215e369ebbb754ab2da263c8c67d7a624b8

Comment 5 Mike Fiedler 2020-06-17 11:52:33 UTC

> I also don't see how much CPU is allocated to fluent.  If it is the default then you should consider increasing it to at least 500m or more.

The fluentd CPU request/limit is whatever the OOTB default is.  I will double check it when I try running unmanaged with the queue changes. 

I can confirm that no messages are going to ES - the doc counts are not changing for the app* indices.

Comment 6 Jeff Cantrill 2020-06-18 15:56:48 UTC

Moving to UpcomingSprint as unlikely to resolve before EOS

Comment 7 Mike Fiedler 2020-06-18 18:48:16 UTC

I changed the clusterlogging CR to Unmanaged and updated the fluentd configmap with:

            # queue_limit_length "#{ENV['BUFFER_QUEUE_LIMIT'] || '32' }"
            queued_chunks_limit_size '32'


But I am still hitting the issue.  After about 5 minutes of logging the app-* index stops updating and /var/lib/fluentd/clo_default_output_es starts filling up and never draining.

Looking on the fluentd pod, I see the changes above in /etc/fluent/configs.d/user/..2020_06_18_17_57_24.830597441/fluent.conf

Comment 8 Jeff Cantrill 2020-06-30 19:59:01 UTC

(In reply to Mike Fiedler from comment #7)
> But I am still hitting the issue.  After about 5 minutes of logging the
> app-* index stops updating and /var/lib/fluentd/clo_default_output_es starts
> filling up and never draining.

Confirmed a similar experience today:

* Watching the buffers after the test shows buffer's cycling and appearing to never be resolved
* App logs no longer grow
* Infra logs continues to grow

I noted the config still includes infra container logs which cycle into the same buffers.  If you modify the source config of 'container-input' to exclude infra pods:

'exclude_path ["/var/log/containers/*-*_openshift*_*.log", "/var/log/containers/*-*_kube*_*.log", "/var/log/containers/*-*_default_*.log"]'

The following command will return with no buffers once the app pod is complete:

'oc exec $pod --  ls /var/lib/fluentd/clo_default_output_es'

IMO the result is buffers are not actually wedged but we are seeing buffers for logs that end up in a different index.  We still have the issue with logs that get rotated out from under fluent before it knows they are there but this is a known issue and we are capped to a known max before we start to miss logs.

Comment 9 Periklis Tsirakidis 2020-07-06 07:35:27 UTC

Put UpcomingSprint label, ongoing investigation as mentioned in [1], not conclusive solution yet.

[1] https://coreos.slack.com/archives/CB3HXM2QK/p1593540395170500?thread_ts=1593308260.140300&cid=CB3HXM2QK

Comment 10 Mike Fiedler 2020-07-07 02:07:39 UTC

Trying comment 8 - possibly a problem with my procedure.  Here it is.

1. Make clusterlogging instance Unmanaged
2. oc edit configmap fluentd
3. replaced the exclude in container-input with:  exclude_path ["/var/log/containers/*-*_openshift*_*.log", "/var/log/containers/*-*_kube*_*.log", "/var/log/containers/*-*_default_*.log"]
4. deleted fluentd pods (10 nodes) and let them restart

After fluentd pods restarted I noticed the infra-* index was still growing.  Is this expected?  Or should it stop growing?   I verified the configmap was not being overwritten.

I continued on and ran the scenario described here and hit the same issue.  I noticed that both app-* and infra-* indices stopped growing.

Let me know if my procedure is incorrect.

Comment 11 Mike Fiedler 2020-07-07 18:42:54 UTC

Ignore comment 10.  With @ematysek help (and the worklad ansible playbook) I disabled infra logging

1. Run the playbook to disable infra, verified infra indices stopped growing
2. Ran ls  /var/lib/fluentd/clo_default_output_es starts on all nodes and verified they were clean - buffers drained
3. Started my workload.  

Made 2 attempts.  On both the app index stopped growing.  Once at 525725 entries and the second time at 420925 entries. Nothing rejected or odd looking in ES queues.

Comment 12 Jeff Cantrill 2020-07-07 19:54:38 UTC

(In reply to Mike Fiedler from comment #11)
> Ignore comment 10.  With @ematysek help (and the worklad ansible playbook) I
> disabled infra logging
> 
> 1. Run the playbook to disable infra, verified infra indices stopped growing
> 2. Ran ls  /var/lib/fluentd/clo_default_output_es starts on all nodes and
> verified they were clean - buffers drained
> 3. Started my workload.  
> 
> Made 2 attempts.  On both the app index stopped growing.  Once at 525725
> entries and the second time at 420925 entries. Nothing rejected or odd
> looking in ES queues.

This confirms what I was suggesting.  There is nothing "wedged" per se where logs are sitting in buffers.  You are experiencing the log rotation/collection issue exhibited by [1].  There appears to be some bursty nature from crio where logs get rotated before fluent is able to collect them and therefore they are gone all together.

Closing this issue as NOTABUG in favor of resolving with [1] as there is nothing wrong on the "sending" side of log collection.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1833486

Comment 13 Mike Fiedler 2020-07-07 20:09:48 UTC

When I restart fluentd, the buffers are drained and the waiting messages are indexed.   I don't think it is a log rotation issue, but I can wait and test again when 1833486 is fixed.

Comment 14 Mike Fiedler 2020-07-07 20:11:41 UTC

Changing to duplicate of 1833486

*** This bug has been marked as a duplicate of bug 1833486 ***

Comment 15 Marko Karg 2020-07-08 06:42:17 UTC

Not sure I follow here - how can this be a log rotation issue if the logs are already in fluentd and stuck there?

Note You need to log in before you can comment on or make changes to this bug.