Bug 1769585 - Fluentd log collector dropping logs
Summary: Fluentd log collector dropping logs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.3.0
Assignee: Lokesh Mandvekar
QA Contact: Eric Matysek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-06 21:54 UTC by Eric Matysek
Modified: 2020-01-23 11:11 UTC (History)
10 users (show)

Fixed In Version: conmon-2.0.3-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:11:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fluentd-w6dk2 logs (34.61 KB, application/gzip)
2019-11-06 21:54 UTC, Eric Matysek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:11:31 UTC

Description Eric Matysek 2019-11-06 21:54:12 UTC
Created attachment 1633469 [details]
fluentd-w6dk2 logs

Description of problem:
Only the first ~50k messages are being logged during logging tests. 
I'm logging from a single pod on 1 node at a rate of 100 msg/s for 120k messages which is a very low rate based on previous testing.


Version-Release number of selected component (if applicable):
ocp: 4.3.0-0.nightly-2019-11-02-092336
elo: 4.2.1-201910221723
clo: 4.2.1-201910221723

How reproducible:
100%

Steps to Reproduce:
1. log 60k+ messages
2. check elasticsearch index to see doc count


Actual results:
50k +/- 5k messages in elasticsearch

Expected results:
120k messages pushed to elasticsearch

Additional info:
The log file on the node is not rotating at all:
sh-4.4# pwd
/var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest
sh-4.4# ls -lh
total 132M
-rw-------. 1 root root 132M Nov  6 17:37 0.log.20191106-212320

Attached fluentd logs

Comment 1 Rich Megginson 2019-11-06 22:06:20 UTC
Is this a regression?  I thought this was working?  When did you first notice the bug?  What changed?

> The log file on the node is not rotating at all:
> sh-4.4# pwd
> /var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest
> sh-4.4# ls -lh
> total 132M
> -rw-------. 1 root root 132M Nov  6 17:37 0.log.20191106-212320

Where is the file 0.log?  Did you check the inode numbers? (ls -i)?

Comment 2 Eric Matysek 2019-11-06 22:40:38 UTC
This was working great on 4.2, this is my first time testing it on 4.3 so this is the first I've noticed any issues.
I expected no issues since I'm not even testing 4.3 operators, just 4.2 operators on a 4.3 cluster. My first instinct is that something is different with how ocp 4.3 handles these log files. Log files should only get as big as containerLogMaxSize, which is definitely less than 132MB.

0.log.20191106-212320 was the only file in that directory when I created this bz, it looks like it keeps being renamed back and forth from 0.log to 0.log.$datetime.
While checking it just now it was renamed from 0.log to 0.log.20191106-223000 and now back to 0.log

sh-4.4# pwd
/var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest
sh-4.4# ls
0.log

It contains all 120k messages
sh-4.4# wc -l 0.log 
120085 0.log

inode search:
sh-4.4# ls -i 0.log.20191106-223000 
26557368 0.log.20191106-223000
sh-4.4# find /var/log/ -inum 26557368
/var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest/0.log.20191106-223000

and after getting renamed again:
sh-4.4# find /var/log/ -inum 26557368
/var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest/0.log.20191106-223411


also for documentation purposes, here is the related index with 53k/120k messages
health status index                                                                         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.logtest-fluentd-100-0.af58ed21-71f3-44dd-90e5-f29c02ef49fe.2019.11.06 67ut_NY-QfybYKSDDr8TwA   3   1      53050            0     74.4mb         37.1mb

Comment 3 Rich Megginson 2019-11-07 02:11:24 UTC
> This was working great on 4.2, this is my first time testing it on 4.3 so this is the first I've noticed any issues.
> I expected no issues since I'm not even testing 4.3 operators, just 4.2 operators on a 4.3 cluster.

So you are running the 4.2 logging operators and images?  So the only difference is possibly the kubelet and cri-o versions?

The way I've noticed that kubelet/cri-o do file rotation is this:

write to 0.log - file has inode N
rename 0.log to 0.log.TS - file has inode N
write to 0.log - file has inode M
compress 0.log.TS to 0.log.TS.gz - file has inode P
rename 0.log to 0.log.TS - file has inode M
write to 0.log - file has inode N - yes, it will recycle the inode

All of this happens very, very fast - you would have to use a tool like inotifywait to see all of the activity

If cri-o/kubelet in 4.3 is doing file rotation differently than in 4.2, this is likely the cause of the fluentd problems

Comment 6 Jeff Cantrill 2019-11-15 21:46:07 UTC
Please continue to investigate and provide feedback

Comment 10 Mike Fiedler 2019-11-21 13:17:10 UTC
Marking this TestBlocker as it is currently blocking fluentd performance testing for the OpenShift EFK Logging stack.   See bug 177107 for a related issue.

Comment 11 Peter Hunt 2019-11-21 15:38:40 UTC
I have identified the root cause. This is caused by a regression in conmon 2.0.2. It has now been fixed upstream in conmon 2.0.3, which is not yet packaged for RHCOS yet.

Jindrich, can you PTAL?

Comment 12 Rich Megginson 2019-11-21 15:43:12 UTC
(In reply to Peter Hunt from comment #11)
> I have identified the root cause. This is caused by a regression in conmon
> 2.0.2. It has now been fixed upstream in conmon 2.0.3, which is not yet
> packaged for RHCOS yet.
> 
> Jindrich, can you PTAL?

Is there an upstream conmon issue or pr?

Comment 13 Peter Hunt 2019-11-21 16:00:41 UTC
PR here https://github.com/containers/conmon/pull/82

Comment 16 Mike Fiedler 2019-11-22 15:51:46 UTC
This can't be ON_QA for OCP until the conmon build in comment 14 is included in an RHCOS image that is included in an OCP 4.3 build.

Comment 19 errata-xmlrpc 2020-01-23 11:11:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.