Bug 1741663
| Summary: | Fluentd single pod logging performance regression in 4.x | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Eric Matysek <ematysek> |
| Component: | Logging | Assignee: | Noriko Hosoi <nhosoi> |
| Status: | CLOSED ERRATA | QA Contact: | Eric Matysek <ematysek> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.0 | CC: | aos-bugs, mfiedler, mifiedle, nhosoi, rmeggins |
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:36:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1765646 | ||
|
Description
Eric Matysek
2019-08-15 18:26:17 UTC
Is this the same problem we are seeing with rsyslog? When was the last time we did a performance test of fluentd? Is this a regression from that test? > It looks like Fluentd isn't picking up the initial logs before they get gzipped (since in both cases it is the first 10-20k messages that are missed then none missing after that). before they get gzipped? I don't know what you mean. > We are again seeing that the missing messages come in multiples of 10,000 which is equivalent to the out of the box max pod log size of 10MB (10,000 1kb messages) So is it possible that in order to fix this problem, we need to adjust the crio max log size and log rotation parameters? (In reply to Rich Megginson from comment #1) > Is this the same problem we are seeing with rsyslog? It is very similar, the key difference being the missing messages are always at the beginning for fluentd and rsyslog will have random chunks of missing logs > When was the last time we did a performance test of fluentd? Is this a > regression from that test? The last performance test of fluentd was for 3.11 if I'm not mistaken, likely before I joined so Mike F. may have to chime in there. > before they get gzipped? I don't know what you mean. Sorry, maybe my statement was not correct. I meant when the logs are rotated as the log file hits the max pod log size. > So is it possible that in order to fix this problem, we need to adjust the > crio max log size and log rotation parameters? I suspect that could fix it, we did see that help with rsyslog and maybe the answer is that the out of the box max pod log size should be bumped up. I'll run a few tests with increased max pod log size and see if I get different results. I was able to reproduce this for 4.1.x fluentd with logging at 1000 mps in a single namespace (i.e. a single pod log growing and rolling/wrapping very fast). I suspect it exists for any release on CRI-O with a very high message rate in a single log. Eric notes that if the kubelet containterLogMaxSize is increased to 50MB (default 10MB) the problem does not occur. This leads me to 3 hypotheses we need to try to prove/deny: 1. Eric is consistently seeing the initial set of messages as the missing ones in the index. If this is true, it implies fluentd is not picking up that there is a new pod log to watch until after it has wrapped the first time 2. If the problem is proven to also occur for middle chunks of messages, it could be a related issue that a full wrap of the pod log occurs in between fluentd checks 3. CRI-O is messing up the symlinks in /var/log/containers and fluentd is not being shown some set of the pod logs. I'm skeptical on this one, but it's possible. @richm - For experimental purposes, is there any way to tune the frequency fluentd checks a log? At a size of 10MB and at 1000 1KB messages/sec, these logs are wrapping every 10 sec. At 2K messages/sec it would be every 5 sec. Opened bug 1741955 for an OOTB default pod log size > 10MB Can you confirm that we have the same problem in both fluentd and rsyslog? Is https://bugzilla.redhat.com/show_bug.cgi?id=1739558 a duplicate of this one? For fluentd - looks like this is controlled by the https://docs.fluentd.org/input/tail#refresh_interval parameter - default is 60 seconds which is probably too long. We would need to set it here - https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/lib/generate_throttle_configs/lib/generate_throttle_configs.rb - and probably allow changing via env. var. For rsyslog - not sure how to fix this - there is this parameter https://www.rsyslog.com/doc/v8-stable/configuration/modules/imfile.html#pollinginterval but that only applies to poll mode, and that mode is deprecated in favor of inotify mode. I'll have to dig into the code to see if there is a way to improve this. > Can you confirm that we have the same problem in both fluentd and rsyslog? Is https://bugzilla.redhat.com/show_bug.cgi?id=1739558 a duplicate of this one
My initial reaction is no since the rsyslog missing records occur with a # of namespaces where fluentd does not have any loss, but it is possible if rsyslog is even slower to read. We need to experiment with increasing the pod log size and see if rsyslog does better. @ematysek have we already tried rsyslog with increased pod log sizes?
In any case, we can prove/disprove some of this with the
> Can you confirm that we have the same problem in both fluentd and rsyslog? Is https://bugzilla.redhat.com/show_bug.cgi?id=1739558 a duplicate of this one
My initial reaction is no since the rsyslog missing records occur with a # of namespaces where fluentd does not have any loss, but it is possible if rsyslog is even slower to read. We need to experiment with increasing the pod log size and see if rsyslog does better. @ematysek have we already tried rsyslog with increased pod log sizes?
In any case, we can prove/disprove some of this if we can tune down the refresh_interval, even for experimental purposes
(In reply to Rich Megginson from comment #5) > For rsyslog - not sure how to fix this - there is this parameter > https://www.rsyslog.com/doc/v8-stable/configuration/modules/imfile. > html#pollinginterval but that only applies to poll mode, and that mode is > deprecated in favor of inotify mode. I'll have to dig into the code to see > if there is a way to improve this. Rsyslog imfile watches both the symlink and the real log file linked from. https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L530-L531 And as soon as the real file is rotated, it should be notified and processed here: https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2393 It might be a silly question... The imfile plugin opens and reads from the symlink /var/log/containers/*.log and if the file is rotated before finishing to read all the logs in the file, the unread logs are dropped? I'm also curious about this reopenOnTruncate="on" set in imfile input (60-mmk8s.conf). I went through the imfile.c code, but I could not find the implementation of strm.SetbReopenOnTruncate (or strmSetbReopenOnTruncate https://github.com/rsyslog/rsyslog/blob/master/runtime/stream.c#L2550). I thought we could try without setting reopenOnTruncate="on", but if it has no impact, it's not worth...) input(type="imfile" ..... tag="kubernetes" addmetadata="on" reopenOnTruncate="on" .....) Here is the code that does the inotify processing: https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2333 It first does an initial scan of the given wildcard files/directories, including following symlinks, and adds those fds to ino_fd: https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2350 Then - it does a tight loop of read(ino_fd) - process the fds hit Note that there is no polling in this loop (runModConf->haveReadTimeouts is 0 by default, and we don't set it) - so files/directories should be processed as fast as possible. The only thing I can think is that perhaps there should be a separate thread for handling new/deleted files/dirs, and a separate thread for actually reading from the files. That is, maybe the function spends so much time reading that it misses file adds/deletes from watched directories. There is debugging: https://github.com/rsyslog/rsyslog/blob/master/plugins/imfile/imfile.c#L2392 - maybe there is a way to enable debugging for just this code. Or maybe there is some other way to prioritize handling new/deleted files so that rsyslog doesn't miss any. This sort of reminds me of the problem with directory server with a thundering herd of new connections - finding some way to handle adding new connections as fast as possible without starving the threads doing operation reading and processing . . . (In reply to Mike Fiedler from comment #3) > 2. If the problem is proven to also occur for middle chunks of messages, it > could be a related issue that a full wrap of the pod log occurs in between > fluentd checks I'm curious about this item. Has it be proven or turned down? And if this original problem "It looks like Fluentd isn't picking up the initial logs" occurs on the first log file(s) from logtest-fluentd only, is it considered a bug or not? Moving back to ASSIGNED - the PR openshift/origin-aggregated-logging/pull/1732 was not intended to fix the bug, only to provide QE/dev with a performance knob. I was searching the similar posts on the web and ran into this stackoverflow [1]. This is what I tried to describe for rsyslog in #c8. It seems some people experienced the skipped logs with fluentd, too. The post proposed to tune "rotate_wait" (default to 5 sec) [2]. By increasing it, even if the log file is rotated, fluentd waits for the seconds to switch to the newly created log file. (The problem is there is no perfect value that works for every case, though...) [1] - https://stackoverflow.com/questions/51097947/fluentd-logging-on-kubernetes-skips-logs-on-log-rotation [2] - https://docs.fluentd.org/input/tail#rotate_wait And I'd think there's no such parameter for rsyslog. "Bug 1739558 - rsyslog underperforming compared to fluentd" might be related since when a log file is rotated while receiving messages of very high volume per sec, at least fluentd waits for 5 sec, but rsyslog goes to the new file immediately, then rsyslog could lose more messages than fluentd? I guess we'd like to know how much behind fluentd/rsyslog reading the log file when it's rotated. Also, there are 3 players here - a cri-o (log writer), kubelet (log rotator), and fluentd/rsyslog (log reader). I wonder it may not be guaranteed that when a log file is rotated, cri-o has already stopped writing to the to-be-rotated file and switched to the new one. 2 questions... 1) currently, rotate_wait is not configurable for our fluentd using an env var. Do we want to add it? 2) do we want to add such a parameter to rsyslog imfile? (not sure how easy it'd be, but when log rotate is notified, we may be able to watch the log file size change and once the delta becomes 0, we could happily switch to the new file? But for now, adding rotate_wait may be good enough to narrow down the problem. If it turns out not a help, we don't have to worry about it. :) (In reply to Noriko Hosoi from comment #13) > I was searching the similar posts on the web and ran into this stackoverflow > [1]. This is what I tried to describe for rsyslog in #c8. It seems some > people experienced the skipped logs with fluentd, too. The post proposed to > tune "rotate_wait" (default to 5 sec) [2]. By increasing it, even if the > log file is rotated, fluentd waits for the seconds to switch to the newly > created log file. (The problem is there is no perfect value that works for > every case, though...) > > [1] - > https://stackoverflow.com/questions/51097947/fluentd-logging-on-kubernetes- > skips-logs-on-log-rotation > [2] - https://docs.fluentd.org/input/tail#rotate_wait > > And I'd think there's no such parameter for rsyslog. "Bug 1739558 - rsyslog > underperforming compared to fluentd" might be related since when a log file > is rotated while receiving messages of very high volume per sec, at least > fluentd waits for 5 sec, but rsyslog goes to the new file immediately, then > rsyslog could lose more messages than fluentd? I guess we'd like to know > how much behind fluentd/rsyslog reading the log file when it's rotated. > Also, there are 3 players here - a cri-o (log writer), kubelet (log > rotator), and fluentd/rsyslog (log reader). I wonder it may not be > guaranteed that when a log file is rotated, cri-o has already stopped > writing to the to-be-rotated file and switched to the new one. > > 2 questions... > 1) currently, rotate_wait is not configurable for our fluentd using an env > var. Do we want to add it? yes > 2) do we want to add such a parameter to rsyslog imfile? (not sure how easy > it'd be, but when log rotate is notified, we may be able to watch the log > file size change and once the delta becomes 0, we could happily switch to > the new file? But for now, adding rotate_wait may be good enough to narrow > down the problem. If it turns out not a help, we don't have to worry about > it. :) Take a look at the rsyslog imfile code and see if you can figure out how/where it handles rotation, and see if you can figure out where we might add functionality like rotate_wait. (In reply to Rich Megginson from comment #12) > Moving back to ASSIGNED - the PR > openshift/origin-aggregated-logging/pull/1732 was not intended to fix the > bug, only to provide QE/dev with a performance knob. Same here with openshift/origin-aggregated-logging/pull/1733. Could you please rerun the test after setting the environment variable CONTAINER_LOGS_ROTATE_WAIT by "oc set env ds/fluentd CONTAINER_LOGS_ROTATE_WAIT=10" (default is 5 sec) and check if the log drop rate changes or not? Note: if this means missing log occurs only at the startup, reconfiguring rotate_wait may not have any impact. > Missing log line(s): 1-12375 (12375) Setting the env var CONTAINER_LOGS_REFRESH_INTERVAL=5 from PR 1732 looks like it does fix the missing initial logs issue. I had 5 runs with fluentd keeping up with 2000 logs per second green open project.logtest-fluentd-3-0.c38c1e6c-c382-11e9-a356-0678ec3fd994.2019.08.20 lcdP4jCuTGqRzFnvUEoiqQ 3 1 2400000 0 3.2gb 1.6gb green open project.logtest-fluentd-4-0.61e28824-c386-11e9-9929-02926b7b81e2.2019.08.20 I8Kdv8OmRVC6ZZbEOTY4QQ 3 1 2400000 0 3.3gb 1.6gb green open project.logtest-fluentd-5-0.ef7304e2-c38c-11e9-9929-02926b7b81e2.2019.08.20 aXnAm5OIQw-0gT_UFDjB4A 3 1 2400000 0 3.2gb 1.6gb green open project.logtest-fluentd-6-0.e5a9d5b1-c390-11e9-9929-02926b7b81e2.2019.08.20 E81LdajqT3O698T7dvBNGw 3 1 2400000 0 3.2gb 1.6gb green open project.logtest-fluentd-7-0.4ad5759f-c394-11e9-9929-02926b7b81e2.2019.08.20 G18vY_uARG2mWXx7NhtAzQ 3 1 2400000 0 3.2gb 1.6gb All 2.4M log lines made it. (In reply to Eric Matysek from comment #16) > Setting the env var CONTAINER_LOGS_REFRESH_INTERVAL=5 from PR 1732 looks > like it does fix the missing initial logs issue. > I had 5 runs with fluentd keeping up with 2000 logs per second > > green open > project.logtest-fluentd-3-0.c38c1e6c-c382-11e9-a356-0678ec3fd994.2019.08.20 > lcdP4jCuTGqRzFnvUEoiqQ 3 1 2400000 0 3.2gb > 1.6gb > green open > project.logtest-fluentd-4-0.61e28824-c386-11e9-9929-02926b7b81e2.2019.08.20 > I8Kdv8OmRVC6ZZbEOTY4QQ 3 1 2400000 0 3.3gb > 1.6gb > green open > project.logtest-fluentd-5-0.ef7304e2-c38c-11e9-9929-02926b7b81e2.2019.08.20 > aXnAm5OIQw-0gT_UFDjB4A 3 1 2400000 0 3.2gb > 1.6gb > green open > project.logtest-fluentd-6-0.e5a9d5b1-c390-11e9-9929-02926b7b81e2.2019.08.20 > E81LdajqT3O698T7dvBNGw 3 1 2400000 0 3.2gb > 1.6gb > green open > project.logtest-fluentd-7-0.4ad5759f-c394-11e9-9929-02926b7b81e2.2019.08.20 > G18vY_uARG2mWXx7NhtAzQ 3 1 2400000 0 3.2gb > 1.6gb > > All 2.4M log lines made it. Should we set the default value to 5 then? Do you plan on doing a test with higher rates than 2000 records per second? (In reply to Rich Megginson from comment #17) > Should we set the default value to 5 then? Do you plan on doing a test with > higher rates than 2000 records per second? The above tests are from a single namespace, today I'll verify if setting the env var has any effect on logging from multiple namespaces. I believe 2000 records per second is the max we plan to test. Did not observer any regression with multiple logging namespaces when lowering CONTAINER_LOGS_REFRESH_INTERVAL to 5. If this has no side effects then it looks like 5 is a good default value if we want to support 2000+ messages per second out of the box. Confirmed bug is no longer present with default settings now. Marking verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |