Bug 1470814

Summary:	on loaded clusters, upgrades of logging pods can fail
Product:	OpenShift Container Platform	Reporter:	Dan Yocum <dyocum>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED NOTABUG	QA Contact:	Xia Zhao <xiazhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.5.1	CC:	aos-bugs, avagarwa, decarr, dyocum, eparis, jcantril, jeder, jokerman, mmccomas, nhosoi, nraghava, pportant, pweil, rmeggins, sdodson, whearn
Target Milestone:	---	Keywords:	OpsBlocker
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-12 18:43:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Yocum 2017-07-13 18:07:24 UTC

Description of problem:

When a cluster is heavily loaded with customer pods, an upgrade of the logging-kibana and logging-fluentd pods can fail to successfully deploy.

Version-Release number of selected component (if applicable):

3.4.1.18 to 3.5.5.31

How reproducible:

Unknown

Steps to Reproduce:
1. Build 3.4 cluster
2. Create 63 projects, each with 1 jenkins-persistent app
3. Upgrade to 3.5

Actual results:

[root@yocum-test-master-2e016 ~]# oc get pods -n logging
NAME                          READY     STATUS        RESTARTS   AGE
logging-curator-3-wg8ct       1/1       Running       0          10h
logging-deployer-v0uzt        0/1       Completed     0          1d
logging-es-eocyi5d3-3-dc3z9   1/1       Running       0          1h
logging-es-iblhhbsd-3-qz03z   1/1       Running       0          1h
logging-fluentd-cx58t         0/1       OutOfmemory   0          1h
logging-fluentd-d3nsx         1/1       Running       0          1h
logging-fluentd-kl1dh         0/1       OutOfmemory   0          1h
logging-fluentd-ndv9j         1/1       Running       0          1h
logging-kibana-3-deploy       0/1       Error         0          10h
logging-kibana-4-2nnw1        2/2       Running       0          1h
logging-kibana-4-f7qfn        2/2       Running       0          1h

logging-kibana was left at v3.4

Expected results:

[root@yocum-test-master-2e016 ~]# oc get pods 
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-3-wg8ct       1/1       Running     0          11h
logging-deployer-v0uzt        0/1       Completed   0          1d
logging-es-eocyi5d3-3-dc3z9   1/1       Running     0          2h
logging-es-iblhhbsd-3-qz03z   1/1       Running     0          2h
logging-fluentd-d3nsx         1/1       Running     0          2h
logging-fluentd-ndv9j         1/1       Running     0          2h
logging-fluentd-vzm2h         1/1       Running     0          25s
logging-fluentd-x74zg         1/1       Running     0          15s
logging-kibana-3-deploy       0/1       Error       0          11h
logging-kibana-4-2nnw1        2/2       Running     0          2h
logging-kibana-4-f7qfn        2/2       Running     0          2h

And logging-kibana upgraded to 3.5

Additional info:


Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Dan Yocum 2017-07-13 18:09:55 UTC

Because of this limitation we can not guarantee metrics and logging will be upgraded properly on Dedicated Customer clusters.

Comment 2 Peter Portante 2017-07-13 18:30:11 UTC

Do we have the logs from the ansible playbook deployment?

Comment 3 Scott Dodson 2017-07-13 18:45:21 UTC

Lets make sure we gather pod logs from all failed pods as well.

Comment 4 Jeff Cantrill 2017-07-13 18:51:19 UTC

To confirm,  I assume this pod 'logging-deployer-v0uzt' is left over from the initial installation of the 3.4 logging components?

Also, regarding:

logging-fluentd-cx58t         0/1       OutOfmemory   0          1h
logging-fluentd-kl1dh         0/1       OutOfmemory

Doesn't this imply:

1. fluentd DS was not given enough memory to function
2. Possibly your infra node doesnt have enough memory

@Peter, do we have any guidelines for sizing fluentd based on number of namespaces and pods?

Comment 5 Wesley Hearn 2017-07-14 13:28:16 UTC

@Jeff Cantrill

The OutOfmemory is not so much having not enough memory defined in the pod it is more the nodes for those pods are scheduled are over loaded and do not have memory available to also run fluentd.

So we need a way to say that the fluentd pods should have priority on being scheduled on nodes.

Comment 17 Dan Yocum 2017-07-17 17:57:56 UTC

Info provided to Peter in c2

No info to provide in c4 - that should go to Peter to answer.

Comment 19 Peter Portante 2017-07-17 20:16:44 UTC

So why did we have to add the memory and cpu limits to fluentd?  What was the driving change for that?  Was there a previous BZ?

fluentd needs to be able to keep up with logs on the node, and these changes can severely limit its ability to do that.

Fluentd is a single-threaded process (ruby limitation), and we don't use the multi-process fluentd extension that allows multiple processors to be engaged.  So it has a natural 1000m CPU limit already.

Comment 20 Rich Megginson 2017-07-17 20:47:14 UTC

(In reply to Peter Portante from comment #19)
> So why did we have to add the memory and cpu limits to fluentd?  What was
> the driving change for that?  Was there a previous BZ?

Which change are you talking about?

Here is the default for 3.3:
https://github.com/openshift/origin-aggregated-logging/blob/release-1.3/deployer/templates/fluentd.yaml#L65

            resources:
              limits:
                cpu: 100m

3.4 added a memory limit too: https://github.com/openshift/origin-aggregated-logging/blob/release-1.4/deployer/templates/fluentd.yaml#L66

            resources:
              limits:
                cpu: 100m
                memory: 512Mi

The defaults were unchanged in 3.5:
https://github.com/openshift/openshift-ansible/blob/release-1.5/roles/openshift_logging/defaults/main.yml#L68

openshift_logging_fluentd_cpu_limit: 100m
openshift_logging_fluentd_memory_limit: 512Mi

https://github.com/openshift/openshift-ansible/blob/release-1.5/roles/openshift_logging/templates/fluentd.j2#L35
        resources:
          limits:
            cpu: {{openshift_logging_fluentd_cpu_limit}}
            memory: {{openshift_logging_fluentd_memory_limit}}


Same in 3.6: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_fluentd/defaults/main.yml#L11


> 
> fluentd needs to be able to keep up with logs on the node, and these changes
> can severely limit its ability to do that.
> 
> Fluentd is a single-threaded process (ruby limitation), and we don't use the
> multi-process fluentd extension that allows multiple processors to be
> engaged.  So it has a natural 1000m CPU limit already.

Comment 21 Noriko Hosoi 2017-07-17 21:01:19 UTC

Additional info (3.6):
defaults/main.yml:openshift_logging_mux_cpu_limit: 500m
defaults/main.yml:openshift_logging_mux_memory_limit: 2Gi

Comment 22 Peter Portante 2017-07-17 23:20:30 UTC

The 500m limit on fluentd will hinder its ability to handle logs given what we have seen so far.  It needs 1000m, or 1 CPU (it is single-threaded).

Has the 2 Gi limit been verified to accommodate fluentd's operational memory needs of running the ruby interpreter and the configured buffer queue (chunk size * buffer queue limit) with some overhead room to spare?

What was the reason for adding the memory limit in 3.5?

It seems that the change in 3.5 to add a memory limit is the base cause of this problem.  Now the Kube scheduler will not run this pod if the available memory of all other pods on the box don't leave enough room for the fluentd pod, when we want it to run regardless, right?.

But doesn't it also means that the pod will be capped and OOM killed if it starts to exceed that limit?  So if we do not have that limit properly calculated, won't we have a problem still?

Comment 23 Rich Megginson 2017-07-17 23:36:13 UTC

(In reply to Peter Portante from comment #22)
> The 500m limit on fluentd will hinder its ability to handle logs given what
> we have seen so far.  It needs 1000m, or 1 CPU (it is single-threaded).

You are suggesting to raise the default for fluentd and mux to 1000m?

> 
> Has the 2 Gi limit been verified to accommodate fluentd's operational memory
> needs of running the ruby interpreter and the configured buffer queue (chunk
> size * buffer queue limit) with some overhead room to spare?

For mux, we did find that using 2Gi helped in the RHV deployment.

> 
> What was the reason for adding the memory limit in 3.5?

You mean 3.4.  https://github.com/openshift/origin-aggregated-logging/commit/a2eca92a817c737206fcae281f888172f55915f9

> 
> It seems that the change in 3.5 to add a memory limit is the base cause of
> this problem.

The change was done in 3.4.

> Now the Kube scheduler will not run this pod if the available
> memory of all other pods on the box don't leave enough room for the fluentd
> pod, when we want it to run regardless, right?.

I guess so.

> 
> But doesn't it also means that the pod will be capped and OOM killed if it
> starts to exceed that limit?  So if we do not have that limit properly
> calculated, won't we have a problem still?

What should the limit be?  Are you suggesting we revert that change and remove the limit?

Comment 24 Peter Portante 2017-07-18 00:37:36 UTC

(In reply to Rich Megginson from comment #23)
> You are suggesting to raise the default for fluentd and mux to 1000m?

Yes, otherwise, we'll have a log slow-down problem that could be avoided.

> For mux, we did find that using 2Gi helped in the RHV deployment.

We have to test specifically that if the buffer queue fills up, we don't get OOM killed.

> What should the limit be?  Are you suggesting we revert that change and
> remove the limit?

We have to verify that buffer_queue_limit * buffer_chunk_size for each elasticsearch output plugin instance, and all other forwarding setups, fit in the allotted memory.

We have two output buffers, one for .operations indices, one for apps, whether or not we have the operations ES instance enabled.  So at the least, those settings times 2 is  required.

If we have additional outputs that use a memory buffer, we'll need to add those to the calculations.

We should also consider sizing the two buffers separately when using the ES ops option, and merging the two output queues when using a single ES instance.

Comment 25 Rich Megginson 2017-07-18 00:49:20 UTC

(In reply to Peter Portante from comment #24)
> (In reply to Rich Megginson from comment #23)
> > You are suggesting to raise the default for fluentd and mux to 1000m?
> 
> Yes, otherwise, we'll have a log slow-down problem that could be avoided.

That might kill and play havoc with upgrades, if they have already planned their cpu usages . . .

> 
> > For mux, we did find that using 2Gi helped in the RHV deployment.
> 
> We have to test specifically that if the buffer queue fills up, we don't get
> OOM killed.
> 
> > What should the limit be?  Are you suggesting we revert that change and
> > remove the limit?
> 
> We have to verify that buffer_queue_limit * buffer_chunk_size for each
> elasticsearch output plugin instance, and all other forwarding setups, fit
> in the allotted memory.
> 
> We have two output buffers, one for .operations indices, one for apps,
> whether or not we have the operations ES instance enabled.  So at the least,
> those settings times 2 is  required.
> 
> If we have additional outputs that use a memory buffer, we'll need to add
> those to the calculations.
> 
> We should also consider sizing the two buffers separately when using the ES
> ops option, and merging the two output queues when using a single ES
> instance.

Noriko already did a lot of work in this area for 3.6: https://github.com/openshift/origin-aggregated-logging/commit/3818101b713d92f11a1ff33fe44f6d735fda2eda

Comment 26 Peter Portante 2017-07-18 02:18:05 UTC

I think we have to revisit the change posted above for 3.6.  I reviewed the PR and posted comments there to consider.

Comment 29 Dan Yocum 2017-07-18 21:18:55 UTC

Setting minimum and maximum memory limits is one thing.

What is going to be done to ensure that enough RAM is _reserved_ on the node to guarantee that the fluentd pod can start?

Comment 39 Peter Portante 2017-09-08 16:13:19 UTC

This bug was filed against 3.5.1, and is targeted at 3.7.

So if we deploy 3.7 logging, are all the podes deployed by default with requssts == limits for memory?

Comment 40 Derek Carr 2017-09-08 17:46:08 UTC

the long term fix is to use priority/pre-emption.

https://github.com/kubernetes/kubernetes.github.io/pull/5328

cluster services should have the highest priority, and therefore other work would be pre-empted to make room.  the earliest this will become beta is in the upstream is 1.9.

Comment 41 Eric Paris 2017-09-08 17:52:32 UTC

I'm thinking we should close this BZ as NOTABUG. We should open cards on the logging/metrics team, for 3.8/3.9 to make sure the installer uses the new priority/premption.

While it will be alpha in 3.8 we can NOT turn it on in starter. There are NO access controls in the alpha 3.8 implementation. But enough should be available for the logging/metrics team to do their part.

I think whomever opens those cards should close this BZ.

Comment 42 Rich Megginson 2017-09-08 17:54:42 UTC

(In reply to Peter Portante from comment #39)
> This bug was filed against 3.5.1, and is targeted at 3.7.
> 
> So if we deploy 3.7 logging, are all the podes deployed by default with
> requssts == limits for memory?

Are you talking about these PRs:
3.6 https://github.com/openshift/openshift-ansible/pull/5276
3.7 https://github.com/openshift/openshift-ansible/pull/5158

If so, have these been built into a released atomic-openshift-ansible package?  If so, does that fix this bz, and can we move this bz to ON_QA?

Comment 43 Jeff Cantrill 2017-09-12 18:43:10 UTC

Closing Per comment#41.  Created card: https://trello.com/c/Dw9Uwyqp