Description of problem: rsyslog is currently dropping logs at lower logging rates than fluentd Version-Release number of selected component (if applicable): 4.2 How reproducible: 100% Steps to Reproduce: 1. Generate logs on one node 2. Verify ElasticSearch index contains expected number of logs Actual results: rsyslog starts dropping logs at a rate around 500/s (10 pods on one node producing 50/s each) fluentd is able to handle 2000/s or more Expected results: rsyslog should handle at least the same amount as fluentd Additional info: Related to bug 1722944
> rsyslog starts dropping logs at a rate around 500/s (10 pods on one node producing 50/s each) So rsyslog is still dropping logs? So https://bugzilla.redhat.com/show_bug.cgi?id=1722944 isn't fixed? I don't understand what you mean by "dropping logs"?
dropping logs = logs are not getting indexed into ElasticSearch
(In reply to Eric Matysek from comment #2) > dropping logs = logs are not getting indexed into ElasticSearch So how is this BZ different than https://bugzilla.redhat.com/show_bug.cgi?id=1722944?
It isn't, I was told I should open a new one though since that one is a lot to read through.
Per scrum, this bug is considered a blocker.
Hi Eric, man I ask you to run the same test with the following sets of rsyslog config change? It can be updated by "oc edit cm rsyslog -c rsyslog". case 1: increasing PersistStateInterval from 0 to 1000 --- /etc/rsyslog.d/60-mmk8s.conf 2019-08-03 22:41:43.416405303 -0700 +++ 60-mmk8s.conf 2019-08-23 13:30:44.492208637 -0700 @@ -8,8 +8,9 @@ file="/var/log/containers/*.log" tag="kubernetes" addmetadata="on" reopenOnTruncate="on" discardTruncatedMsg="on" msgDiscardingError="off" freshStartTail=`echo $RSYSLOG_FILE_READ_FROM_TAIL` + PersistStateInterval="1000" endmsg.regex="(^[^ ]+ (stdout|stderr) F )|(\\n\"}$)|(\\n\",\")") # " if ((strlen($!CONTAINER_NAME) > 0) and (strlen($!CONTAINER_ID_FULL) > 0)) or If we do not see any difference with the case 1, then can we try the polling mode? case 2: using polling mode --- /etc/rsyslog.d/10-viaq-modules.conf 2019-08-03 09:46:59.160801385 -0700 +++ 10-viaq-modules.conf 2019-08-23 13:42:22.098584866 -0700 @@ -1,6 +1,6 @@ # Read from file with inotify enabled -module(load="imfile" mode="inotify") +module(load="imfile" mode="polling" PollingInterval="1") # Read from journal module(load="imjournal" StateFile=`echo $RSYSLOG_IMJOURNAL_STATE` UsePid="both" RateLimit.Burst="1000000" RateLimit.Interval="10" PersistStateInterval="1000" The imfile doc recommends to use inotify mode, there was some cases similar issues solved by switching to polling mode, e.g., https://github.com/rsyslog/rsyslog/issues/1605.
Based on my testing today, neither config change looked like it made a significant impact of rsyslog performance
(In reply to Eric Matysek from comment #7) > Based on my testing today, neither config change looked like it made a > significant impact of rsyslog performance Thank you very much for your testing, Eric. I expected case 1 change could improve the rsyslog performance better and when a log rotation occurs there were no (or much less) backlog left which are likely the missing logs. Also, there was a hope that the inotify mode could be the cause of the problem. The inotify mode is rather new. And the combination of inotify and reading from symlink is even newer, i.e., it has not been used widely so far. But we are afraid this might fall into the Caveats/Known Bugs "symlink may not always be properly processed" in https://www.rsyslog.com/doc/v8-stable/configuration/modules/imfile.html. Continue investigating...
https://github.com/openshift/cluster-logging-operator/pull/236 doesn't fix the problem, but it helps - moving back to ASSIGNED
current status - rsyslog is still not working correctly I'm using https://github.com/richm/origin-aggregated-logging/commit/f8be57baa76f6b2ac969f0601d0d5cf0c217dc74 for performance testing in the CI environment There is definitely a problem with rsyslog correctly handling cri-o/kubelet log file rotation. I can see that rsyslog misses the last several records in the file. kubelet renames the file e.g. 0.log -> 0.log.timestamp - this means the file keeps the same inode and remains open. ls -ali /var/log/pods/perfbasic1/logging-load-driver total 216884 185544759 drwxr-xr-x. 3 root root 109 Sep 3 16:39 ../ 19466 -rw-r--r--. 1 root root 44766710 Sep 3 16:44 0.log.20190903-164321.gz 19471 -rw-r--r--. 1 root root 40145884 Sep 3 16:44 0.log.20190903-164404.gz 19472 -rw-r--r--. 1 root root 40778012 Sep 3 16:45 0.log.20190903-164447.gz 19464 -rw-------. 1 root root 62901287 Sep 3 16:45 0.log.20190903-164530 19462 drwxr-xr-x. 2 root root 144 Sep 3 16:45 ./ 19465 -rw-------. 1 root root 20338907 Sep 3 16:45 0.log rsyslog is notified immediately when the file rotates. rsyslog then tries to read from the file until EOF (until a read() system call returns 0 and reads 0 bytes). When this happens, it completely stops reading from the file. But if cri-o is logging at a high rate, it may continue writing to the file, even after kubelet has renamed it. The way fluentd works is that it continues reading from the file, _after_ it has been rotated, for up to rotate_wait seconds https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/lib/generate_throttle_configs/lib/generate_throttle_configs.rb#L71. We may need to add this functionality to rsyslog - it currently has no such setting https://www.rsyslog.com/doc/v8-stable/configuration/modules/imfile.html I've also tried setting reopenOnTruncate="off" (we set it to "on") but that just causes a lot of errors like {"@timestamp":"2019-09-03T16:47:40.973981+00:00", "host":"rsyslog-jx9rc", "level":"warning", "facility":"syslog", "syslog-tag":"rsyslogd-2175:", "source":"rsyslogd-2175", "procid":"-", "msgid":"-", "message":"imfile: internal error? inotify provided watch descriptor 106 which we could not find in our tables - ignored [v8.37.0-9.el7 try http:\/\/www.rsyslog.com\/e\/2175 ]"}
> I've also tried setting reopenOnTruncate="off" (we set it to "on") but that just causes a lot of errors like {"@timestamp":"2019-09-03T16:47:40.973981+00:00", "host":"rsyslog-jx9rc", "level":"warning", "facility":"syslog", "syslog-tag":"rsyslogd-2175:", "source":"rsyslogd-2175", "procid":"-", "msgid":"-", "message":"imfile: internal error? inotify provided watch descriptor 106 which we could not find in our tables - ignored [v8.37.0-9.el7 try http:\/\/www.rsyslog.com\/e\/2175 ]"} I get this error with reopenOnTruncate="on" as well - but it doesn't seem to completely stop rsyslog from sending logs to elasticsearch - I can see logs being added still.
Another strategy - perhaps fluentd doesn't use this because it doesn't use inotify - keep reading from the file until it is closed. man inotify: IN_CLOSE_WRITE (+) File opened for writing was closed. I think there is only one process actually writing to the file, so when we get IN_CLOSE_WRITE on the rotated file, we know it has been closed. Using inotifywait during the run seems to support this theory: /var/log/pods/perfbasic1_perfbasic1_944f6464-ce76-11e9-b9ee-0ac699f7f852/logging-load-driver/ MOVED_FROM 0.log /var/log/pods/perfbasic1_perfbasic1_944f6464-ce76-11e9-b9ee-0ac699f7f852/logging-load-driver/ MOVED_TO 0.log.20190903-181831 /var/log/pods/perfbasic1_perfbasic1_944f6464-ce76-11e9-b9ee-0ac699f7f852/logging-load-driver/ MODIFY 0.log.20190903-181831 ... /var/log/pods/perfbasic1_perfbasic1_944f6464-ce76-11e9-b9ee-0ac699f7f852/logging-load-driver/ MODIFY 0.log.20190903-181831 /var/log/containers/perfbasic1_perfbasic1_logging-load-driver-54786addf92ec1337a62f5ba98330b7fa5588509f884db9b78049994265cd5ed.log MODIFY /var/log/pods/perfbasic1_perfbasic1_944f6464-ce76-11e9-b9ee-0ac699f7f852/logging-load-driver/ CLOSE_WRITE,CLOSE 0.log.20190903-181831 /var/log/containers/perfbasic1_perfbasic1_logging-load-driver-54786addf92ec1337a62f5ba98330b7fa5588509f884db9b78049994265cd5ed.log CLOSE_WRITE,CLOSE MOVED_FROM/MOVED_TO is the rename - this is rotating the file by renaming it - the inode is still the same. cri-o/conmon/kubelet continues writing to the file - MODIFY. Then the file is closed CLOSE_WRITE,CLOSE. /var/log/pods/perfbasic1_perfbasic1_944f6464-ce76-11e9-b9ee-0ac699f7f852/logging-load-driver/ CLOSE_NOWRITE,CLOSE 0.log.20190903-181525 I believe this is rsyslog closing the file - NOWRITE means the file was not opened for writing, which is probably rsyslog. I think we can use this inotify behavior to make rsyslog wait until it gets a CLOSE_WRITE,CLOSE on the file before it stops reading from it.
I think this bz is related to https://bugzilla.redhat.com/show_bug.cgi?id=1763746
Closing deferred as its uncertain if/when rsyslog may be be released as a collector
this is likely related to https://bugzilla.redhat.com/show_bug.cgi?id=1772672