Bug 1584665
|
Description
Mike Fiedler
2018-05-31 12:20:36 UTC
Created attachment 1446260 [details]
fluentd rss no growth during 3.9.2
There are some dropouts in the telemetry but you can see steady rss at ~220MB
Let me know if you want to try a specific earlier build (pre new fluentd + retry plugins?) or some other scenario. Bulk queue stats for the run: host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.22.0.134 465179 0 0 0 50 172.20.2.47 436505 0 0 0 50 172.21.2.72 465163 0 0 0 50 This is a non-mux configuration. The referenced bug 1502764 states that it was for a mux configuration, but the problem pre-jemalloc was for mux and non-mux. The 3.9 comparison in comment 1 was a non-mux run. RSS = resident set size which is +/- what kube uses for limits With the new image from @jcantril, the memory growth and OOM kills go away. I'll attach a new graph of steady state. For this workload (500 messages/second/node), fluentd stabilizes and oscillates in the 400-420MB range vs 220-250MB range for 3.9 which may be a consequence of the new retry methods. Do we want to consider increasing the OOTM memory limit for logging-fluentd? Created attachment 1446426 [details]
With fix: rss for 500 1K messages/second/node
Created attachment 1446632 [details]
fluentd rss with prom plugin and no memory limit
7 hours of steady state running the same workload (500 1K messages/sec/node) with the 3.10.0-0.54.0 image and no memory limit for logging-fluentd. It looks like the memory usage is possibly stable in the 500-700MB range but I believe there is still an overall upward trend.
More investigation is required before enabling the plugin as GA.
Created attachment 1446678 [details] fluentd rss with prom plugin and no memory limit Replacing bad screenshot from comment 8 Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/22611b39a28e223102fe790fb7b37417b75382fe bug 1584665. Remove prometheus plugin https://github.com/openshift/origin-aggregated-logging/commit/90a8f48000ff644c224f1e8507867f795a2c3dc4 Merge pull request #1193 from jcantrill/1584665_remove_prom bug 1584665. Remove prometheus plugin Moving to MODIFIED until this is available in an OCP puddle (latest build is 2 June) Created attachment 1450180 [details] fluentd 3.10.0-0.64.0 - memory usage "ratcheting" up On 3.10.0-0.64.0 I am seeing different behavior than I did when testing with the private build provided in comment 6. Instead of the flat memory usage (screenshot in comment 10), I'm seeing memory usage slowly ratchet up. Going to run overnight to see if it continues to grow and OOMs. Any significant changes to fluentd between the fix on June 1st and the current build? There are virtually no differences between the "good" image used for https://bugzilla.redhat.com/show_bug.cgi?id=1584665#c7 and the "bad" image. Perhaps we would have seen memory growth even with the "good" image given a longer time frame? Created attachment 1450482 [details] 3.10.0-0.64.0 fluentd memory growth and oom_kill during long run Ran overnight and 2 fluentd pods were oom_killed. The growth in 3.10.0-0.64.0 is slower than originally reported, but there is a clear upward trend (see attached rss graph for overnight run). re: comment 17: The test with the scratch build (comment 7) is over 2 hours and does not show the pattern that I currently see in 3.10.0-0.64.0. But yes, it was a shorter test. Actions on the QE side: 1. I still have the scratch build container image. I'll re-run and have a mix of 3.10.0-0.64.0 and the scratch build for side-by-side comparison 2. Run without a fluentd memory limit and see if it stabilizes above 512MB 3. Anything else you want to try. Performing a long run (~24 hours) with no memory limits set on fluentd at a message rate of 500 1K messages/second/node, the fluentd memory usage eventually settles in at 400-450MB. During the ramp up, it does exceed the current 512MB limit several times explaining the OOM in comment 18. At 500 2K messages/sec/node the usage is stable at 550-600MB. Performing a third test now with 750 2K message/sec/node and will report results. Marking this bz as verified on 3.10.0-0.64.0 and will open a new bz to consider raising the OOTP memory request limit. Created attachment 1450971 [details]
fluentd rss initial ramp up and settling into steady state
Created attachment 1450972 [details]
fluentd steady state at 500 1K mps
The jump at the end was when the logfile generators were stopped. A phenomena that should be investigated is that when fluentd goes idle after being busy for a prolonged period, rss does jump for a while before settling back down.
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1590920 to track consideration of a higher limit. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |