Bug 1466005

Summary: [DOCS] When Fluentd logger is unable to keep up with high amounts of logs, the cpu and memory limits are configurable.
Product: OpenShift Container Platform Reporter: Noriko Hosoi <nhosoi>
Component: DocumentationAssignee: Brandi Munilla <bmcelvee>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact: Vikram Goyal <vigoyal>
Priority: medium    
Version: 3.6.0CC: aos-bugs, a, bmcelvee, jcantril, jokerman, mmccomas, nhosoi, pportant, rhowe, rmeggins, xiazhao
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1445053 Environment:
Last Closed: 2017-08-09 20:33:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1445053    
Bug Blocks:    

Description Noriko Hosoi 2017-06-28 16:46:04 UTC
This is a cloned bug of 1445053 for the Documentation.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

When Fluentd logger is unable to keep up, the following steps are recommended.
Edit the daemonset of fluentd,
oc edit daemonset logging-fluentd

Search resource limits:

        resources:
          limits:
            cpu: 100m
            memory: 512Mi

Increase the values considering the available computer resources [1].  For instance,

        resources:
          limits:
            cpu: 150m
            memory: 1Gi

The memory limit is used to calculate the fluentd buffer_queue_limit as follows:

buffer_queue_limit = resource memory limit / (number of output * buffer_chunk_size) for the fluentd

buffer_chunk_size is 1MB, by default.

For instance, if the fluentd outputs to 2 elasticsearch pods, then the new buffer_queue_limit would be 1G / (2 * 1M) = 512.


The same reconfiguration is available for the mux if the mux server is behind the incoming logs.

Edit the deploymentconfig of mux,
oc edit deploymentconfig logging-mux

Search resource limits:

        resources:
          limits:
            cpu: 500m
            memory: 2Gi

Increase the values considering the available computer resources [1].  For instance,

        resources:
          limits:
            cpu: 600m
            memory: 2.5Gi

The memory limit is used to calculate the mux buffer_queue_limit as follows:

buffer_queue_limit = resource memory limit / (number of output * buffer_chunk_size) for the fluentd

buffer_chunk_size is 1MB, by default.

For instance, if the fluentd outputs to 1 elasticsearch pod, then the new buffer_queue_limit would be 2.5G / (1 * 1M) = 2560.

[1] https://docs.openshift.com/container-platform/3.5(<=6?)/admin_guide/overcommit.html#requests-and-limits

Comment 3 Brandi Munilla 2017-07-24 19:59:06 UTC
Hi Noriko and Xia,

I open PR4837[1] with the new Tune Buffer Chunk Limit section. Please review when you get a chance. 

Thanks!

[1]https://github.com/openshift/openshift-docs/pull/4837

Comment 4 Junqi Zhao 2017-07-28 04:19:02 UTC
@Brandi,

Documentation is ok, but IMHO, I see the following:
The memory limit is used to calculate the Fluentd buffer_queue_limit by dividing the resource memory limit by the number of output multiplied by the buffer_chunk_size

From user perspective,user would be more clearly if we describe like this:
buffer_queue_limit = resource memory limit / (number of output * buffer_chunk_size).

Also, I think we should describe more clearly about "number of output", I think "number of elasticsearch pods output" is better.

@Noriko,
What do you think?

Comment 5 Noriko Hosoi 2017-07-28 05:50:55 UTC
@Junqi, @Brandi, sorry for this change at the last moment.

Fluentd does not use memory for the buffer queue as of OCP 3.6.  It switches to the file buffering to reduce the memory usage and prevent the data loss.

On the Fluentd and Mux pod, permanent volume /var/lib/fluentd is supposed to be prepared, e.g., by pvc or hostmount. Then, the area is used for the file buffers.

The buffer_type and buffer_path are configured in the fluentd config files as follows:
$ egrep "buffer_type|buffer_path" *.conf
es-copy-config.conf:       buffer_type file
es-copy-config.conf:       buffer_path '/var/lib/fluentd/buffer-es-copy-config'
es-ops-copy-config.conf:   buffer_type file
es-ops-copy-config.conf:   buffer_path '/var/lib/fluentd/buffer-es-ops-copy-config'
output-es-config.conf:     buffer_type file
output-es-config.conf:     buffer_path '/var/lib/fluentd/buffer-output-es-config'
output-es-ops-config.conf: buffer_type file
output-es-ops-config.conf: buffer_path '/var/lib/fluentd/buffer-output-es-ops-config'

The fluentd's buffer_chunk_limit is determined by the environment variable BUFFER_SIZE_LIMIT.  The file buffer size per output is determined by the environment variable FILE_BUFFER_LIMIT.  The permanent volume size has to be larger than FILE_BUFFER_LIMIT times number of output.  For instance, if the fluentd outputs the log to 2 elasticsearch'es, the pod has to have larger disk space than (FILE_BUFFER_LIMIT * 2).

fluentd's buffer_queue_limit is calculated as (FILE_BUFFER_LIMIT / BUFFER_SIZE_LIMIT).

Comment 6 Junqi Zhao 2017-07-28 07:19:24 UTC
@Brandi

We have to change this documentation based on Comment 5

Comment 7 Brandi Munilla 2017-08-04 17:32:58 UTC
@Junqi, @Noriko,

I updated the PR to reflect the changes requested Comment 5. Please review for accuracy. 

Thanks!

Comment 8 Noriko Hosoi 2017-08-04 18:48:23 UTC
(In reply to Brandi from comment #7)
> @Junqi, @Noriko,
> 
> I updated the PR to reflect the changes requested Comment 5. Please review
> for accuracy. 
> 
> Thanks!

Ahhh, so sorry, @Brandi.  I should have updated this bug the day before yesterday...  The feature described in #c5 was not merged to 3.6, but deferred to 3.6.1. (;_;)

Could you please backoff the fluentd changes in https://github.com/openshift/openshift-docs/pull/4837?  Please keep the changes you made (I've reviewed them and added some comments) for the 3.6.1 release?

Let me reset the status to ASSIGNED again (sorry...).

Since the previous version was reviewed by @Junqi (See #c4), I think we could just ack it.  But if you could give me one more chance, I'd appreciate it.  Thanks!

Comment 9 Brandi Munilla 2017-08-04 19:36:36 UTC
Thank you @Noriko! I saved the original changes in a separate file just in case we'll need them in the future. 

And thank you for your review. I'll have the PR ready for another review in just a few minutes.

Comment 10 Junqi Zhao 2017-08-07 06:32:02 UTC
(In reply to Brandi from comment #9)
> Thank you @Noriko! I saved the original changes in a separate file just in
> case we'll need them in the future. 
> 
> And thank you for your review. I'll have the PR ready for another review in
> just a few minutes.

@Noriko, @Brandi

Is The PR still https://github.com/openshift/openshift-docs/pull/4837 ?
I think the documentation in PR 4837 is description for feature "Use `file` buffer instead of `memory` buffer for fluentd",(https://trello.com/c/XpreI533/509-5-use-file-buffer-instead-of-memory-buffer-for-fluentdloggingepic-ois-agl-perf), from Comment 5, it's deferred to 3.6.1

From Comment 8, I think the description should be like Comment 0, and there are some advices from my side, see Comment 4.

@Noriko, am I right?

Comment 11 Noriko Hosoi 2017-08-07 14:54:31 UTC
(In reply to Junqi Zhao from comment #10)
> (In reply to Brandi from comment #9)
> > Thank you @Noriko! I saved the original changes in a separate file just in
> > case we'll need them in the future. 
> > 
> > And thank you for your review. I'll have the PR ready for another review in
> > just a few minutes.
> 
> @Noriko, @Brandi
> 
> Is The PR still https://github.com/openshift/openshift-docs/pull/4837 ?
> I think the documentation in PR 4837 is description for feature "Use `file`
> buffer instead of `memory` buffer for
> fluentd",(https://trello.com/c/XpreI533/509-5-use-file-buffer-instead-of-
> memory-buffer-for-fluentdloggingepic-ois-agl-perf), from Comment 5, it's
> deferred to 3.6.1
> 
> From Comment 8, I think the description should be like Comment 0, and there
> are some advices from my side, see Comment 4.
> 
> @Noriko, am I right?

Yes, you are right, @Junqi.  The doc should not include the "file buffer" at all for 3.6...   We have to wait for 3.6.1 to use the file buffer version doc.
Thanks!

Comment 14 Junqi Zhao 2017-08-08 06:18:28 UTC
Documentation is wrong, "tune-buffer-chunk-limit" should not be in this file, please see Comment 4, and Comment 11, your original file is right, description should be like content in Comment 0.

Comment 16 Junqi Zhao 2017-08-09 00:34:36 UTC
Documentation is fine, set it to VERIFIED

Comment 17 Noriko Hosoi 2017-08-09 00:37:55 UTC
(In reply to Junqi Zhao from comment #16)
> Documentation is fine, set it to VERIFIED

+1

Thank you, @Brandi.  Thank you, @Junqi.

Comment 18 openshift-github-bot 2017-08-09 00:42:28 UTC
Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/e6198aea7bb54ce2be336dcaed460985ecce3292
Bug 1466005 Add Tune Buffer Chunk Limit

Comment 19 Brandi Munilla 2017-08-09 00:44:22 UTC
Thank you @Noriko and @Junqi!