Bug 1470814
Summary: | on loaded clusters, upgrades of logging pods can fail | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dan Yocum <dyocum> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED NOTABUG | QA Contact: | Xia Zhao <xiazhao> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.5.1 | CC: | aos-bugs, avagarwa, decarr, dyocum, eparis, jcantril, jeder, jokerman, mmccomas, nhosoi, nraghava, pportant, pweil, rmeggins, sdodson, whearn |
Target Milestone: | --- | Keywords: | OpsBlocker |
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-09-12 18:43:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Yocum
2017-07-13 18:07:24 UTC
Because of this limitation we can not guarantee metrics and logging will be upgraded properly on Dedicated Customer clusters. Do we have the logs from the ansible playbook deployment? Lets make sure we gather pod logs from all failed pods as well. To confirm, I assume this pod 'logging-deployer-v0uzt' is left over from the initial installation of the 3.4 logging components? Also, regarding: logging-fluentd-cx58t 0/1 OutOfmemory 0 1h logging-fluentd-kl1dh 0/1 OutOfmemory Doesn't this imply: 1. fluentd DS was not given enough memory to function 2. Possibly your infra node doesnt have enough memory @Peter, do we have any guidelines for sizing fluentd based on number of namespaces and pods? @Jeff Cantrill The OutOfmemory is not so much having not enough memory defined in the pod it is more the nodes for those pods are scheduled are over loaded and do not have memory available to also run fluentd. So we need a way to say that the fluentd pods should have priority on being scheduled on nodes. Info provided to Peter in c2 No info to provide in c4 - that should go to Peter to answer. So why did we have to add the memory and cpu limits to fluentd? What was the driving change for that? Was there a previous BZ? fluentd needs to be able to keep up with logs on the node, and these changes can severely limit its ability to do that. Fluentd is a single-threaded process (ruby limitation), and we don't use the multi-process fluentd extension that allows multiple processors to be engaged. So it has a natural 1000m CPU limit already. (In reply to Peter Portante from comment #19) > So why did we have to add the memory and cpu limits to fluentd? What was > the driving change for that? Was there a previous BZ? Which change are you talking about? Here is the default for 3.3: https://github.com/openshift/origin-aggregated-logging/blob/release-1.3/deployer/templates/fluentd.yaml#L65 resources: limits: cpu: 100m 3.4 added a memory limit too: https://github.com/openshift/origin-aggregated-logging/blob/release-1.4/deployer/templates/fluentd.yaml#L66 resources: limits: cpu: 100m memory: 512Mi The defaults were unchanged in 3.5: https://github.com/openshift/openshift-ansible/blob/release-1.5/roles/openshift_logging/defaults/main.yml#L68 openshift_logging_fluentd_cpu_limit: 100m openshift_logging_fluentd_memory_limit: 512Mi https://github.com/openshift/openshift-ansible/blob/release-1.5/roles/openshift_logging/templates/fluentd.j2#L35 resources: limits: cpu: {{openshift_logging_fluentd_cpu_limit}} memory: {{openshift_logging_fluentd_memory_limit}} Same in 3.6: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_fluentd/defaults/main.yml#L11 > > fluentd needs to be able to keep up with logs on the node, and these changes > can severely limit its ability to do that. > > Fluentd is a single-threaded process (ruby limitation), and we don't use the > multi-process fluentd extension that allows multiple processors to be > engaged. So it has a natural 1000m CPU limit already. Additional info (3.6): defaults/main.yml:openshift_logging_mux_cpu_limit: 500m defaults/main.yml:openshift_logging_mux_memory_limit: 2Gi The 500m limit on fluentd will hinder its ability to handle logs given what we have seen so far. It needs 1000m, or 1 CPU (it is single-threaded). Has the 2 Gi limit been verified to accommodate fluentd's operational memory needs of running the ruby interpreter and the configured buffer queue (chunk size * buffer queue limit) with some overhead room to spare? What was the reason for adding the memory limit in 3.5? It seems that the change in 3.5 to add a memory limit is the base cause of this problem. Now the Kube scheduler will not run this pod if the available memory of all other pods on the box don't leave enough room for the fluentd pod, when we want it to run regardless, right?. But doesn't it also means that the pod will be capped and OOM killed if it starts to exceed that limit? So if we do not have that limit properly calculated, won't we have a problem still? (In reply to Peter Portante from comment #22) > The 500m limit on fluentd will hinder its ability to handle logs given what > we have seen so far. It needs 1000m, or 1 CPU (it is single-threaded). You are suggesting to raise the default for fluentd and mux to 1000m? > > Has the 2 Gi limit been verified to accommodate fluentd's operational memory > needs of running the ruby interpreter and the configured buffer queue (chunk > size * buffer queue limit) with some overhead room to spare? For mux, we did find that using 2Gi helped in the RHV deployment. > > What was the reason for adding the memory limit in 3.5? You mean 3.4. https://github.com/openshift/origin-aggregated-logging/commit/a2eca92a817c737206fcae281f888172f55915f9 > > It seems that the change in 3.5 to add a memory limit is the base cause of > this problem. The change was done in 3.4. > Now the Kube scheduler will not run this pod if the available > memory of all other pods on the box don't leave enough room for the fluentd > pod, when we want it to run regardless, right?. I guess so. > > But doesn't it also means that the pod will be capped and OOM killed if it > starts to exceed that limit? So if we do not have that limit properly > calculated, won't we have a problem still? What should the limit be? Are you suggesting we revert that change and remove the limit? (In reply to Rich Megginson from comment #23) > You are suggesting to raise the default for fluentd and mux to 1000m? Yes, otherwise, we'll have a log slow-down problem that could be avoided. > For mux, we did find that using 2Gi helped in the RHV deployment. We have to test specifically that if the buffer queue fills up, we don't get OOM killed. > What should the limit be? Are you suggesting we revert that change and > remove the limit? We have to verify that buffer_queue_limit * buffer_chunk_size for each elasticsearch output plugin instance, and all other forwarding setups, fit in the allotted memory. We have two output buffers, one for .operations indices, one for apps, whether or not we have the operations ES instance enabled. So at the least, those settings times 2 is required. If we have additional outputs that use a memory buffer, we'll need to add those to the calculations. We should also consider sizing the two buffers separately when using the ES ops option, and merging the two output queues when using a single ES instance. (In reply to Peter Portante from comment #24) > (In reply to Rich Megginson from comment #23) > > You are suggesting to raise the default for fluentd and mux to 1000m? > > Yes, otherwise, we'll have a log slow-down problem that could be avoided. That might kill and play havoc with upgrades, if they have already planned their cpu usages . . . > > > For mux, we did find that using 2Gi helped in the RHV deployment. > > We have to test specifically that if the buffer queue fills up, we don't get > OOM killed. > > > What should the limit be? Are you suggesting we revert that change and > > remove the limit? > > We have to verify that buffer_queue_limit * buffer_chunk_size for each > elasticsearch output plugin instance, and all other forwarding setups, fit > in the allotted memory. > > We have two output buffers, one for .operations indices, one for apps, > whether or not we have the operations ES instance enabled. So at the least, > those settings times 2 is required. > > If we have additional outputs that use a memory buffer, we'll need to add > those to the calculations. > > We should also consider sizing the two buffers separately when using the ES > ops option, and merging the two output queues when using a single ES > instance. Noriko already did a lot of work in this area for 3.6: https://github.com/openshift/origin-aggregated-logging/commit/3818101b713d92f11a1ff33fe44f6d735fda2eda I think we have to revisit the change posted above for 3.6. I reviewed the PR and posted comments there to consider. Setting minimum and maximum memory limits is one thing. What is going to be done to ensure that enough RAM is _reserved_ on the node to guarantee that the fluentd pod can start? This bug was filed against 3.5.1, and is targeted at 3.7. So if we deploy 3.7 logging, are all the podes deployed by default with requssts == limits for memory? the long term fix is to use priority/pre-emption. https://github.com/kubernetes/kubernetes.github.io/pull/5328 cluster services should have the highest priority, and therefore other work would be pre-empted to make room. the earliest this will become beta is in the upstream is 1.9. I'm thinking we should close this BZ as NOTABUG. We should open cards on the logging/metrics team, for 3.8/3.9 to make sure the installer uses the new priority/premption. While it will be alpha in 3.8 we can NOT turn it on in starter. There are NO access controls in the alpha 3.8 implementation. But enough should be available for the logging/metrics team to do their part. I think whomever opens those cards should close this BZ. (In reply to Peter Portante from comment #39) > This bug was filed against 3.5.1, and is targeted at 3.7. > > So if we deploy 3.7 logging, are all the podes deployed by default with > requssts == limits for memory? Are you talking about these PRs: 3.6 https://github.com/openshift/openshift-ansible/pull/5276 3.7 https://github.com/openshift/openshift-ansible/pull/5158 If so, have these been built into a released atomic-openshift-ansible package? If so, does that fix this bz, and can we move this bz to ON_QA? Closing Per comment#41. Created card: https://trello.com/c/Dw9Uwyqp |