Bug 1741663

Summary:	Fluentd single pod logging performance regression in 4.x
Product:	OpenShift Container Platform	Reporter:	Eric Matysek <ematysek>
Component:	Logging	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED ERRATA	QA Contact:	Eric Matysek <ematysek>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	aos-bugs, mfiedler, mifiedle, nhosoi, rmeggins
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:36:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1765646

Description Eric Matysek 2019-08-15 18:26:17 UTC

Description of problem:
With a single pod producing logs on a node, not all messages are getting indexed into ElasticSearch.

Version-Release number of selected component (if applicable):


How reproducible:
500/s sometimes has missing logs
750/s and up is almost guaranteed to have missing messages


Steps to Reproduce:
1. Set single pod to log at specified rate
2. Check ElasticSearch index contains correct # of messages after

Actual results:
750/s test with 900k messages:
health status index                                                                          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.logtest-fluentd-750-0.1b0f929c-bf7e-11e9-9755-06f0e95a3b8e.2019.08.15  E_kJwd-2Tyivuolm5X00pg   3   1     887625            0      1.2gb        630.7mb

Missing log line(s): 1-12375 (12375)
Number of missing logs: 12375
1.3750% message loss rate


1000/s test with 1.2m messages:
health status index                                                                          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.logtest-fluentd-1000-0.9c34a9a1-bf81-11e9-9755-06f0e95a3b8e.2019.08.15 Kd83F0YET0i6ay5pBuli1Q   3   1    1178500            0      1.6gb        833.8mb

Missing log line(s): 1-21500 (21500)
Number of missing logs: 21500
1.7917% message loss rate


Expected results:
Expected 900k and 1.2m messages respectively

Additional info:
It looks like Fluentd isn't picking up the initial logs before they get gzipped (since in both cases it is the first 10-20k messages that are missed then none missing after that).
We are again seeing that the missing messages come in multiples of 10,000 which is equivalent to the out of the box max pod log size of 10MB (10,000 1kb messages)

Comment 1 Rich Megginson 2019-08-15 18:40:44 UTC

Is this the same problem we are seeing with rsyslog?

When was the last time we did a performance test of fluentd?  Is this a regression from that test?

> It looks like Fluentd isn't picking up the initial logs before they get gzipped (since in both cases it is the first 10-20k messages that are missed then none missing after that).

before they get gzipped?  I don't know what you mean.

> We are again seeing that the missing messages come in multiples of 10,000 which is equivalent to the out of the box max pod log size of 10MB (10,000 1kb messages)

So is it possible that in order to fix this problem, we need to adjust the crio max log size and log rotation parameters?

Comment 2 Eric Matysek 2019-08-15 19:46:18 UTC

(In reply to Rich Megginson from comment #1)
> Is this the same problem we are seeing with rsyslog?

It is very similar, the key difference being the missing messages are always at the beginning for fluentd and rsyslog will have random chunks of missing logs

> When was the last time we did a performance test of fluentd?  Is this a
> regression from that test?

The last performance test of fluentd was for 3.11 if I'm not mistaken, likely before I joined so Mike F. may have to chime in there.

> before they get gzipped?  I don't know what you mean.

Sorry, maybe my statement was not correct. I meant when the logs are rotated as the log file hits the max pod log size.

> So is it possible that in order to fix this problem, we need to adjust the
> crio max log size and log rotation parameters?

I suspect that could fix it, we did see that help with rsyslog and maybe the answer is that the out of the box max pod log size should be bumped up.
I'll run a few tests with increased max pod log size and see if I get different results.

Comment 3 Mike Fiedler 2019-08-16 14:03:31 UTC

I was able to reproduce this for 4.1.x fluentd with logging at 1000 mps in a single namespace (i.e. a single pod log growing and rolling/wrapping very fast).  I suspect it exists for any release on CRI-O with a very high message rate in a single log.

Eric notes that if the kubelet containterLogMaxSize is increased to 50MB (default 10MB) the problem does not occur.   This leads me to 3 hypotheses we need to try to prove/deny:


1. Eric is consistently seeing the initial set of messages as the missing ones in the index.  If this is true, it implies fluentd is not picking up that there is a new pod log to watch until after it has wrapped the first time
2. If the problem is proven to also occur for middle chunks of messages, it could be a related issue that a full wrap of the pod log occurs in between fluentd checks 
3. CRI-O is messing up the symlinks in /var/log/containers and fluentd is not being shown some set of the pod logs.  I'm skeptical on this one, but it's possible.

@richm - For experimental purposes, is there any way to tune the frequency fluentd checks a log?   At a size of 10MB and at 1000 1KB messages/sec, these logs are wrapping every 10 sec.   At 2K messages/sec it would be every 5 sec.

Comment 4 Mike Fiedler 2019-08-16 14:15:49 UTC

Opened bug 1741955 for an OOTB default pod log size > 10MB

Comment 5 Rich Megginson 2019-08-16 18:41:44 UTC

Can you confirm that we have the same problem in both fluentd and rsyslog?  Is https://bugzilla.redhat.com/show_bug.cgi?id=1739558 a duplicate of this one?

For fluentd - looks like this is controlled by the https://docs.fluentd.org/input/tail#refresh_interval parameter - default is 60 seconds which is probably too long.  We would need to set it here - https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/lib/generate_throttle_configs/lib/generate_throttle_configs.rb - and probably allow changing via env. var.

For rsyslog - not sure how to fix this - there is this parameter https://www.rsyslog.com/doc/v8-stable/configuration/modules/imfile.html#pollinginterval but that only applies to poll mode, and that mode is deprecated in favor of inotify mode.  I'll have to dig into the code to see if there is a way to improve this.

Comment 6 Mike Fiedler 2019-08-16 18:51:19 UTC

> Can you confirm that we have the same problem in both fluentd and rsyslog?  Is https://bugzilla.redhat.com/show_bug.cgi?id=1739558 a duplicate of this one

My initial reaction is no since the rsyslog missing records occur with a # of namespaces where fluentd does not have any loss, but it is possible if rsyslog is even slower to read. We need to experiment with increasing the pod log size and see if rsyslog does better.   @ematysek have we already tried rsyslog with increased pod log sizes?

In any case, we can prove/disprove some of this with the

Comment 7 Mike Fiedler 2019-08-16 18:52:22 UTC

> Can you confirm that we have the same problem in both fluentd and rsyslog?  Is https://bugzilla.redhat.com/show_bug.cgi?id=1739558 a duplicate of this one

My initial reaction is no since the rsyslog missing records occur with a # of namespaces where fluentd does not have any loss, but it is possible if rsyslog is even slower to read. We need to experiment with increasing the pod log size and see if rsyslog does better.   @ematysek have we already tried rsyslog with increased pod log sizes?

In any case, we can prove/disprove some of this if we can tune down the refresh_interval, even for experimental purposes

Comment 8 Noriko Hosoi 2019-08-17 01:02:10 UTC

(In reply to Rich Megginson from comment #5)
> For rsyslog - not sure how to fix this - there is this parameter
> https://www.rsyslog.com/doc/v8-stable/configuration/modules/imfile.
> html#pollinginterval but that only applies to poll mode, and that mode is
> deprecated in favor of inotify mode.  I'll have to dig into the code to see
> if there is a way to improve this.

Rsyslog imfile watches both the symlink and the real log file linked from.
https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L530-L531

And as soon as the real file is rotated, it should be notified and processed here:
https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2393

It might be a silly question...  The imfile plugin opens and reads from the symlink /var/log/containers/*.log and if the file is rotated before finishing to read all the logs in the file, the unread logs are dropped?

I'm also curious about this reopenOnTruncate="on" set in imfile input (60-mmk8s.conf).  I went through the imfile.c code, but I could not find the implementation of strm.SetbReopenOnTruncate (or strmSetbReopenOnTruncate https://github.com/rsyslog/rsyslog/blob/master/runtime/stream.c#L2550).  I thought we could try without setting reopenOnTruncate="on", but if it has no impact, it's not worth...)

input(type="imfile"
      .....
      tag="kubernetes" addmetadata="on" reopenOnTruncate="on"
      .....)

Comment 9 Rich Megginson 2019-08-17 03:17:45 UTC

Here is the code that does the inotify processing: https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2333

It first does an initial scan of the given wildcard files/directories, including following symlinks, and adds those fds to ino_fd:
https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2350

Then - it does a tight loop of read(ino_fd) - process the fds hit

Note that there is no polling in this loop (runModConf->haveReadTimeouts is 0 by default, and we don't set it) - so files/directories should be processed as fast as possible.  The only thing I can think is that perhaps there should be a separate thread for handling new/deleted files/dirs, and a separate thread for actually reading from the files.  That is, maybe the function spends so much time reading that it misses file adds/deletes from watched directories.  There is debugging: https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2392 - maybe there is a way to enable debugging for just this code.  Or maybe there is some other way to prioritize handling new/deleted files so that rsyslog doesn't miss any.  This sort of reminds me of the problem with directory server with a thundering herd of new connections - finding some way to handle adding new connections as fast as possible without starving the threads doing operation reading and processing . . .

Comment 10 Noriko Hosoi 2019-08-19 18:53:42 UTC

(In reply to Mike Fiedler from comment #3)
> 2. If the problem is proven to also occur for middle chunks of messages, it
> could be a related issue that a full wrap of the pod log occurs in between
> fluentd checks 

I'm curious about this item.  Has it be proven or turned down?

And if this original problem "It looks like Fluentd isn't picking up the initial logs" occurs on the first log file(s) from logtest-fluentd only, is it considered a bug or not?

Comment 12 Rich Megginson 2019-08-19 23:42:48 UTC

Moving back to ASSIGNED - the PR openshift/origin-aggregated-logging/pull/1732 was not intended to fix the bug, only to provide QE/dev with a performance knob.

Comment 13 Noriko Hosoi 2019-08-20 22:55:30 UTC

I was searching the similar posts on the web and ran into this stackoverflow [1]. This is what I tried to describe for rsyslog in #c8. It seems some people experienced the skipped logs with fluentd, too. The post proposed to tune "rotate_wait" (default to 5 sec) [2]. By increasing it, even if the log file is rotated, fluentd waits for the seconds to switch to the newly created log file. (The problem is there is no perfect value that works for every case, though...)

[1] - https://stackoverflow.com/questions/51097947/fluentd-logging-on-kubernetes-skips-logs-on-log-rotation
[2] - https://docs.fluentd.org/input/tail#rotate_wait

And I'd think there's no such parameter for rsyslog. "Bug 1739558 - rsyslog underperforming compared to fluentd" might be related since when a log file is rotated while receiving messages of very high volume per sec, at least fluentd waits for 5 sec, but rsyslog goes to the new file immediately, then rsyslog could lose more messages than fluentd? I guess we'd like to know how much behind fluentd/rsyslog reading the log file when it's rotated. Also, there are 3 players here - a cri-o (log writer), kubelet (log rotator), and fluentd/rsyslog (log reader). I wonder it may not be guaranteed that when a log file is rotated, cri-o has already stopped writing to the to-be-rotated file and switched to the new one.

2 questions...
1) currently, rotate_wait is not configurable for our fluentd using an env var. Do we want to add it?
2) do we want to add such a parameter to rsyslog imfile? (not sure how easy it'd be, but when log rotate is notified, we may be able to watch the log file size change and once the delta becomes 0, we could happily switch to the new file? But for now, adding rotate_wait may be good enough to narrow down the problem. If it turns out not a help, we don't have to worry about it. :)

Comment 14 Rich Megginson 2019-08-20 23:03:11 UTC

(In reply to Noriko Hosoi from comment #13)
> I was searching the similar posts on the web and ran into this stackoverflow
> [1].  This is what I tried to describe for rsyslog in #c8.  It seems some
> people experienced the skipped logs with fluentd, too.  The post proposed to
> tune "rotate_wait" (default to 5 sec) [2].  By increasing it, even if the
> log file is rotated, fluentd waits for the seconds to switch to the newly
> created log file.  (The problem is there is no perfect value that works for
> every case, though...)
> 
> [1] -
> https://stackoverflow.com/questions/51097947/fluentd-logging-on-kubernetes-
> skips-logs-on-log-rotation
> [2] - https://docs.fluentd.org/input/tail#rotate_wait
> 
> And I'd think there's no such parameter for rsyslog.  "Bug 1739558 - rsyslog
> underperforming compared to fluentd" might be related since when a log file
> is rotated while receiving messages of very high volume per sec, at least
> fluentd waits for 5 sec, but rsyslog goes to the new file immediately, then
> rsyslog could lose more messages than fluentd?  I guess we'd like to know
> how much behind fluentd/rsyslog reading the log file when it's rotated. 
> Also, there are 3 players here - a cri-o (log writer), kubelet (log
> rotator), and fluentd/rsyslog (log reader).  I wonder it may not be
> guaranteed that when a log file is rotated, cri-o has already stopped
> writing to the to-be-rotated file and switched to the new one.
> 
> 2 questions...
> 1) currently, rotate_wait is not configurable for our fluentd using an env
> var.  Do we want to add it?

yes

> 2) do we want to add such a parameter to rsyslog imfile? (not sure how easy
> it'd be, but when log rotate is notified, we may be able to watch the log
> file size change and once the delta becomes 0, we could happily switch to
> the new file?  But for now, adding rotate_wait may be good enough to narrow
> down the problem.  If it turns out not a help, we don't have to worry about
> it. :)

Take a look at the rsyslog imfile code and see if you can figure out how/where it handles rotation, and see if you can figure out where we might add functionality like rotate_wait.

Comment 15 Noriko Hosoi 2019-08-21 06:11:04 UTC

(In reply to Rich Megginson from comment #12)
> Moving back to ASSIGNED - the PR
> openshift/origin-aggregated-logging/pull/1732 was not intended to fix the
> bug, only to provide QE/dev with a performance knob.

Same here with openshift/origin-aggregated-logging/pull/1733.

Could you please rerun the test after setting the environment variable CONTAINER_LOGS_ROTATE_WAIT by "oc set env ds/fluentd CONTAINER_LOGS_ROTATE_WAIT=10" (default is 5 sec) and check if the log drop rate changes or not?

Note: if this means missing log occurs only at the startup, reconfiguring rotate_wait may not have any impact.
> Missing log line(s): 1-12375 (12375)

Comment 16 Eric Matysek 2019-08-21 13:56:16 UTC

Setting the env var CONTAINER_LOGS_REFRESH_INTERVAL=5 from PR 1732 looks like it does fix the missing initial logs issue.
I had 5 runs with fluentd keeping up with 2000 logs per second

green  open   project.logtest-fluentd-3-0.c38c1e6c-c382-11e9-a356-0678ec3fd994.2019.08.20 lcdP4jCuTGqRzFnvUEoiqQ   3   1    2400000            0      3.2gb          1.6gb
green  open   project.logtest-fluentd-4-0.61e28824-c386-11e9-9929-02926b7b81e2.2019.08.20 I8Kdv8OmRVC6ZZbEOTY4QQ   3   1    2400000            0      3.3gb          1.6gb
green  open   project.logtest-fluentd-5-0.ef7304e2-c38c-11e9-9929-02926b7b81e2.2019.08.20 aXnAm5OIQw-0gT_UFDjB4A   3   1    2400000            0      3.2gb          1.6gb
green  open   project.logtest-fluentd-6-0.e5a9d5b1-c390-11e9-9929-02926b7b81e2.2019.08.20 E81LdajqT3O698T7dvBNGw   3   1    2400000            0      3.2gb          1.6gb
green  open   project.logtest-fluentd-7-0.4ad5759f-c394-11e9-9929-02926b7b81e2.2019.08.20 G18vY_uARG2mWXx7NhtAzQ   3   1    2400000            0      3.2gb          1.6gb

All 2.4M log lines made it.

Comment 17 Rich Megginson 2019-08-21 14:00:49 UTC

(In reply to Eric Matysek from comment #16)
> Setting the env var CONTAINER_LOGS_REFRESH_INTERVAL=5 from PR 1732 looks
> like it does fix the missing initial logs issue.
> I had 5 runs with fluentd keeping up with 2000 logs per second
> 
> green  open  
> project.logtest-fluentd-3-0.c38c1e6c-c382-11e9-a356-0678ec3fd994.2019.08.20
> lcdP4jCuTGqRzFnvUEoiqQ   3   1    2400000            0      3.2gb         
> 1.6gb
> green  open  
> project.logtest-fluentd-4-0.61e28824-c386-11e9-9929-02926b7b81e2.2019.08.20
> I8Kdv8OmRVC6ZZbEOTY4QQ   3   1    2400000            0      3.3gb         
> 1.6gb
> green  open  
> project.logtest-fluentd-5-0.ef7304e2-c38c-11e9-9929-02926b7b81e2.2019.08.20
> aXnAm5OIQw-0gT_UFDjB4A   3   1    2400000            0      3.2gb         
> 1.6gb
> green  open  
> project.logtest-fluentd-6-0.e5a9d5b1-c390-11e9-9929-02926b7b81e2.2019.08.20
> E81LdajqT3O698T7dvBNGw   3   1    2400000            0      3.2gb         
> 1.6gb
> green  open  
> project.logtest-fluentd-7-0.4ad5759f-c394-11e9-9929-02926b7b81e2.2019.08.20
> G18vY_uARG2mWXx7NhtAzQ   3   1    2400000            0      3.2gb         
> 1.6gb
> 
> All 2.4M log lines made it.

Should we set the default value to 5 then?  Do you plan on doing a test with higher rates than 2000 records per second?

Comment 18 Eric Matysek 2019-08-21 14:11:51 UTC

(In reply to Rich Megginson from comment #17)
> Should we set the default value to 5 then?  Do you plan on doing a test with
> higher rates than 2000 records per second?

The above tests are from a single namespace, today I'll verify if setting the env var has any effect on logging from multiple namespaces.
I believe 2000 records per second is the max we plan to test.

Comment 19 Eric Matysek 2019-08-22 17:09:36 UTC

Did not observer any regression with multiple logging namespaces when lowering CONTAINER_LOGS_REFRESH_INTERVAL to 5.
If this has no side effects then it looks like 5 is a good default value if we want to support 2000+ messages per second out of the box.

Comment 21 Eric Matysek 2019-08-28 20:20:30 UTC

Confirmed bug is no longer present with default settings now. Marking verified.

Comment 22 errata-xmlrpc 2019-10-16 06:36:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922