Created attachment 1633469 [details] fluentd-w6dk2 logs Description of problem: Only the first ~50k messages are being logged during logging tests. I'm logging from a single pod on 1 node at a rate of 100 msg/s for 120k messages which is a very low rate based on previous testing. Version-Release number of selected component (if applicable): ocp: 4.3.0-0.nightly-2019-11-02-092336 elo: 4.2.1-201910221723 clo: 4.2.1-201910221723 How reproducible: 100% Steps to Reproduce: 1. log 60k+ messages 2. check elasticsearch index to see doc count Actual results: 50k +/- 5k messages in elasticsearch Expected results: 120k messages pushed to elasticsearch Additional info: The log file on the node is not rotating at all: sh-4.4# pwd /var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest sh-4.4# ls -lh total 132M -rw-------. 1 root root 132M Nov 6 17:37 0.log.20191106-212320 Attached fluentd logs
Is this a regression? I thought this was working? When did you first notice the bug? What changed? > The log file on the node is not rotating at all: > sh-4.4# pwd > /var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest > sh-4.4# ls -lh > total 132M > -rw-------. 1 root root 132M Nov 6 17:37 0.log.20191106-212320 Where is the file 0.log? Did you check the inode numbers? (ls -i)?
This was working great on 4.2, this is my first time testing it on 4.3 so this is the first I've noticed any issues. I expected no issues since I'm not even testing 4.3 operators, just 4.2 operators on a 4.3 cluster. My first instinct is that something is different with how ocp 4.3 handles these log files. Log files should only get as big as containerLogMaxSize, which is definitely less than 132MB. 0.log.20191106-212320 was the only file in that directory when I created this bz, it looks like it keeps being renamed back and forth from 0.log to 0.log.$datetime. While checking it just now it was renamed from 0.log to 0.log.20191106-223000 and now back to 0.log sh-4.4# pwd /var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest sh-4.4# ls 0.log It contains all 120k messages sh-4.4# wc -l 0.log 120085 0.log inode search: sh-4.4# ls -i 0.log.20191106-223000 26557368 0.log.20191106-223000 sh-4.4# find /var/log/ -inum 26557368 /var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest/0.log.20191106-223000 and after getting renamed again: sh-4.4# find /var/log/ -inum 26557368 /var/log/pods/logtest-fluentd-100-0_centos-logtest-tknvs_c86aae60-8297-4451-8d1f-304da0422199/centos-logtest/0.log.20191106-223411 also for documentation purposes, here is the related index with 53k/120k messages health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open project.logtest-fluentd-100-0.af58ed21-71f3-44dd-90e5-f29c02ef49fe.2019.11.06 67ut_NY-QfybYKSDDr8TwA 3 1 53050 0 74.4mb 37.1mb
> This was working great on 4.2, this is my first time testing it on 4.3 so this is the first I've noticed any issues. > I expected no issues since I'm not even testing 4.3 operators, just 4.2 operators on a 4.3 cluster. So you are running the 4.2 logging operators and images? So the only difference is possibly the kubelet and cri-o versions? The way I've noticed that kubelet/cri-o do file rotation is this: write to 0.log - file has inode N rename 0.log to 0.log.TS - file has inode N write to 0.log - file has inode M compress 0.log.TS to 0.log.TS.gz - file has inode P rename 0.log to 0.log.TS - file has inode M write to 0.log - file has inode N - yes, it will recycle the inode All of this happens very, very fast - you would have to use a tool like inotifywait to see all of the activity If cri-o/kubelet in 4.3 is doing file rotation differently than in 4.2, this is likely the cause of the fluentd problems
Please continue to investigate and provide feedback
Marking this TestBlocker as it is currently blocking fluentd performance testing for the OpenShift EFK Logging stack. See bug 177107 for a related issue.
I have identified the root cause. This is caused by a regression in conmon 2.0.2. It has now been fixed upstream in conmon 2.0.3, which is not yet packaged for RHCOS yet. Jindrich, can you PTAL?
(In reply to Peter Hunt from comment #11) > I have identified the root cause. This is caused by a regression in conmon > 2.0.2. It has now been fixed upstream in conmon 2.0.3, which is not yet > packaged for RHCOS yet. > > Jindrich, can you PTAL? Is there an upstream conmon issue or pr?
PR here https://github.com/containers/conmon/pull/82
This can't be ON_QA for OCP until the conmon build in comment 14 is included in an RHCOS image that is included in an OCP 4.3 build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062