Bug 1813759

Summary: BUFFER_QUEUE_FULL_ACTION=drop_oldest_chunk option ignores BUFFER_QUEUE_LIMIT=32 in fluentd
Product: OpenShift Container Platform Reporter: Hyosun Kim <hyoskim>
Component: LoggingAssignee: Periklis Tsirakidis <periklis>
Status: CLOSED NOTABUG QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: high    
Version: 3.11.0CC: aos-bugs, jcantril, periklis
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-06 07:40:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hyosun Kim 2020-03-16 02:02:16 UTC
Description of problem:

+Environment
OCP 3.11.157
openshift-logging prjoect
logging-es-data-master pod :0  //temporarily stopped pod
fluentd log forwording by td-agent

+Issue :
After set BUFFER_QUEUE_FULL_ACTION=drop_oldest_chuck, buffer files are not deleted and continue to accumulate.
Before setting the option, the number of files is always 32.

Version-Release number of selected component (if applicable):
OCP 3.11.157

How reproducible:
set the options in fluentd 
BUFFER_QUEUE_FULL_ACTION=drop_oldest_chunk     //set option
BUFFER_QUEUE_LIMIT=32

Actual results: buffer files are not deleted and continue to accumulate.

Information from logging-fluentd-lhp5t
Connection #0 to host logging-es left intact
--Persistence stats for pod logging-fluentd-lhp5t
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/rhel-root  100G   12G   89G  12% /var/lib/fluentd
total 3483648
-rw-r--r--. 1 root root 8388418 Mar 13 14:58 buffer-output-es-config.output_tag.q5a0b6288a0cd093c.log
-rw-r--r--. 1 root root 8387804 Mar 13 14:58 buffer-output-es-config.output_tag.q5a0b62801e8ce806.log
-rw-r--r--. 1 root root 8387625 Mar 13 14:58 buffer-output-es-config.output_tag.q5a0b627acdf0f77c.log
-rw-r--r--. 1 root root 8387763 Mar 13 14:58 buffer-output-es-config.output_tag.q5a0b627667e58e63.log
-rw-r--r--. 1 root root 8387900 Mar 13 14:58 buffer-output-es-config.output_tag.q5a0b6271acb16780.log
-rw-r--r--. 1 root root 8387789 Mar 13 14:58 buffer-output-es-config.output_tag.q5a0b626d38872ee4.log
-rw-r--r--. 1 root root 8387285 Mar 13 14:57 buffer-output-es-config.output_tag.q5a0b626765d2b7d3.log
-rw-r--r--. 1 root root 8387374 Mar 13 14:57 buffer-output-es-config.output_tag.q5a0b6261fe1c6814.log
-rw-r--r--. 1 root root 8387143 Mar 13 14:57 buffer-output-es-config.output_tag.q5a0b625be7577f85.log
-rw-r--r--. 1 root root 8386508 Mar 13 14:57 buffer-output-es-config.output_tag.q5a0b6257678b2695.log
-rw-r--r--. 1 root root 8387108 Mar 13 14:57 buffer-output-es-config.output_tag.q5a0b62530f0f2dd0.log
-rw-r--r--. 1 root root 8387925 Mar 13 14:57 buffer-output-es-config.output_tag.q5a0b624e2bf083da.log
-rw-r--r--. 1 root root 8387289 Mar 13 14:57 buffer-output-es-config.output_tag.q5a0b6249e02b6d1e.log
...
Total 403 buffer-output-es-config.ouput files.

2.logging-fluentd-lhp5t.log (trace level)
2020-03-13 14:57:22 +0900 [debug]: buffer queue is full. Dropping oldest chunk
2020-03-13 14:57:26 +0900 [debug]: buffer queue is full. Dropping oldest chunk
2020-03-13 14:57:32 +0900 [debug]: buffer queue is full. Dropping oldest chunk
2020-03-13 14:57:36 +0900 [debug]: buffer queue is full. Dropping oldest chunk
2020-03-13 14:57:41 +0900 [debug]: buffer queue is full. Dropping oldest chunk
2020-03-13 14:57:47 +0900 [debug]: buffer queue is full. Dropping oldest chunk
2020-03-13 14:57:48 +0900 [info]: stats - namespace_cache_size: 4, pod_cache_size: 23, namespace_cache_api_updates: 23, pod_cache_api_updates: 23, id_cache_miss: 23


Expected results:
deleting old buffer files and maintaining only 32 files through this option
BUFFER_QUEUE_LIMIT=32

Additional info:

Comment 9 Hyosun Kim 2020-04-28 01:53:19 UTC
Hi @Periklis

This is the result of running the script.

$ oc project openshift-logging
$ oc exec logging-fluentd-5bnlk -- /fluentd-env-debug.sh
NUM_OUTPUTS: 2
DF_LIMIT: 107321753600
FILE_BUFFER_LIMIT: 256Mi
TOTAL_LIMIT: 536870912
BUFFER_SIZE_LIMIT: 8388608
TOTAL_BUFFER_SIZE_LIMIT: 268435456
BUFFER_QUEUE_LIMIT: 32
BUFFER_QUEUE_FULL_ACTION: drop_oldest_chunk

Thank you.

Comment 10 Periklis Tsirakidis 2020-05-04 11:22:14 UTC
@Hyosun Kim

Regarding the numbers on [1] I've gone through the calculations and they are correct. This means we neither have a config generation issue nor any miscalculation of limits of any sort.

I have repeatedly tried to reproduce this issue but without any success. After a thorough analysis on the logs I can tell that the bunch of chunk files from the 3/13/20 are maybe an effect of fluentd not being able to send data to the elasticsearch store according to failures, e.g.:

[warn]: temporarily failed to flush the buffer. next_retry=2020-03-12 08:38:28 +0900 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster

Do you still see the chunk files piling up? Or do you see only piled up chunk files between the 3/11/20 and 3/13/20?

From what I can tell according to source code analysis on fluentd 0.12, the behaviour of `block` essentially blocks the whole thread. This translates easily in a maximum amount of files. However, the behaviour of `drop_oldest_chunk` only pops one namely the oldest chunk for a new incoming chunk from the buffer queue by synchronising the queue access. I cannot tell if later is a bug in the fluentd runtime. Furthermore the only trail I have for you is to check if fluentd is unable to unlink old chunk files (see [2]).

Can you check the nodes for any filesystem access errors?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1813759#c9
[2] https://github.com/varunkumta/fluentd/blob/2050bd793a28a0b8943fe82533aac6cefa027de6/lib/fluent/plugin/buf_file.rb#L61

Comment 12 Periklis Tsirakidis 2020-05-06 07:40:54 UTC
Based on customer response in #11 I close this issue as not a bug.