Created attachment 1334446 [details] oc describe node output Description of problem: OCP 3.7 should be able to handle 500 pods per compute node. When deploying 450 pausepods (gcr.io/google_containers/pause-amd64:3.0) or higher, syslog and journal start filling up with lopping messages on the compute nodes: Oct 04 18:30:12 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:12.913801 121634 fsHandler.go:131] du and find on following dirs took 1.033634575s: [/var/lib/docker/overlay2/db37f9c43f4922f28eea582ab57a59b19a1125d04d7a47be7fe45c37996e3279 /var/lib/docker/containers/0f0fbd7daa586d4ff66a8dc575efda6560d7d081b21cef6b4324e42acc2c162d] Oct 04 18:30:12 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:12.948987 121634 fsHandler.go:131] du and find on following dirs took 1.407000223s: [/var/lib/docker/overlay2/65cdb97fba8477dc02e37d9dcff3c114dabedf48ab4fa8da8f35a87d48cf10f5 /var/lib/docker/containers/12da28aa7e65933c889859af1930b11348d25ecf4a5bb041f15a2b071143192b] Oct 04 18:30:13 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:13.020452 121634 fsHandler.go:131] du and find on following dirs took 2.546725632s: [/var/lib/docker/overlay2/27de9d8f5420dc3939bb1511d294979165f620d0de0837cdaa87377889700343 /var/lib/docker/containers/0d7ab284078da73a7bba280b781ff8792aaaf9579dba0c325f56b20ff24c760d] Oct 04 18:30:13 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:13.041726 121634 fsHandler.go:131] du and find on following dirs took 1.648106469s: [/var/lib/docker/overlay2/80ed41d0b54316c9dfa7344a0b7b6698b9c97d5bf90fcc434020b31dc8567100 /var/lib/docker/containers/53e240765945e23f4e08119eec18420133d2792c35812f2593aeda4d6beece72] This is seen on AWC EC2 clusters ( 1 master etcd, 1 infra nodes, 2 compute nodes) with instant type m4.xlarge (4 vCPU, 16 GB Memory, 750 Mbps dedicated EBS Bandwidth) and m4.2xlarge (8 vCPUs, 32GB memory, 1000 Mbps dedicated EBS Bandwidth). Similar messages were reported in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1459265 and pointed to cadvisor normal behavior. The compute nodes with these looping log messages are not starved for resources, overall node CPU utilization was around 30%, memory used 35%, 55 peak IOPS, and 13 Mbits/sec peak Network). CPU and Memory requests for pause-pods are 0%. Concern is running out of disk space over time as several millions of these messages are written to the logs in a space of a day or two. On one compute node after leaving the pods running over night and repeating the test in a span of 2 days: /var/log # grep -c "du and find" messages 3418737 /var/log # grep -c "du and find" messages-20171001 1488200 Version-Release number of selected component (if applicable): openshift v3.7.0-0.127.0 kubernetes v1.7.0+80709908fd etcd 3.2.1 # docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-61.git85d7426.el7.x86_64 Go version: go1.8.3 Git commit: 85d7426/1.12.6 Built: Tue Sep 26 15:30:51 2017 OS/Arch: linux/amd64 How reproducible: All the time as long as you have 450 or more of the pause pods deployed per compute node. We ended up with 500 pause pods per node Steps to Reproduce: 1. Git clone the openshift svt repo 2. cd svt/openshift_scalability Run cluster-loader.py (https://github.com/openshift/svt/blob/master/openshift_scalability/README.md) with config file: https://github.com/openshift/svt/blob/master/openshift_scalability/config/nodeVertical.yaml: python ./cluster-loader.py -vf config/nodeVertical.yaml 3. Wait till you have deployed at least 450 pause pods per node. The script will deploy 500 pods per compute node (total 1000 per cluster) 4. tail -f /var/log/messages or journalctl -f: Actual results: syslog and journal are filling up with looping messages such as this one: Oct 04 18:30:12 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:12.913801 121634 fsHandler.go:131] du and find on following dirs took 1.033634575s: Expected results: These log messages should not be constantly written and looping Additional info: pbench benchmarking data and link to syslog messages file from one of the compute nodes with 500 pause pods will be sent in private comment
Created attachment 1334447 [details] top command output on one of the compute node with looping messages
Walid, I noticed that too, when I was running 512 pods/node on a 20 node cluster. But nodes in my cluster, I had 512 pods running for the past 7 days and the count in messages is: /var/log # grep -c "du and find" messages 3815820 So, filling disk space at this rate is really alarming. Roughly, it would take The offending line in code seems to be: https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L131 where glog.V(2) is being used. and in OCP, we have default log level set to 2. So, this line is continuously being logged. There are 2 solutions that I could think of. - Increase the glog.V(2) -> higher value like glog.V(4) or above. - Increase the timeout at https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L130(which is always 1sec, irrespective of number of directories being listed)
(In reply to ravig from comment #3) > Walid, > > I noticed that too, when I was running 512 pods/node on a 20 node cluster. > But nodes in my cluster, I had 512 pods running for the past 7 days and the > count in messages is: > > /var/log # grep -c "du and find" messages > 3815820 > > So, filling disk space at this rate is really alarming. Roughly, it would > take > > The offending line in code seems to be: > https://github.com/google/cadvisor/blob/master/container/common/fsHandler. > go#L131 > > where glog.V(2) is being used. > > and in OCP, we have default log level set to 2. So, this line is > continuously being logged. > > There are 2 solutions that I could think of. > > - Increase the glog.V(2) -> higher value like glog.V(4) or above. > > - Increase the timeout at > https://github.com/google/cadvisor/blob/master/container/common/fsHandler. > go#L130(which is always 1sec, irrespective of number of directories being > listed) Roughly, it would take 23-25 days to fill up 10GB of disk at this rate and log level.
cadvisor PR: https://github.com/google/cadvisor/pull/1766
Origin PR: https://github.com/openshift/origin/pull/16697
Verified fix on OCP Cluster m4.xlarge instance type on EC2, v3.7.0-0.178.0, 1 master/etcd. 1 infra node and 2 compute nodes. "du and find" log messages no longer continuously loop with 500 pausepods per compute node. # openshift version openshift v3.7.0-0.178.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 After 4 hours with 500 pods running on that compute node: # journalctl -r | grep -c "du and find" 9116 # journalctl -r | grep "du and find" | head -1 Oct 27 15:46:06 ip-172-31-50-56.us-west-2.compute.internal atomic-openshift-node[15749]: I1027 15:46:06.593787 15749 fsHandler.go:135] du and find on following dirs took 10.364326077s: [/var/lib/docker/overlay2/6af7b0be155344d1727089ff5d33e4d0aa50705a3bce31d6ab7f0d910ecc4a8a/diff /var/lib/docker/containers/2c77970c4140884c62deb5de46597703ab68af470ac5362e7a3571e4f00b0745]; will not log again for this container unless duration exceeds 11000000000 seconds. Do we want to keep the timeout duration at this very large value of 11000000000 seconds ? Or choose a reasonably lower value ?
Moving this back to ON_QA to get confirmation that the above message is expected. @Seth Is 11000000000 seconds the expected value? I took a quick look at the code and it wasn't clear that's how the adaptive backoff should trigger.
So the logic correct but the duration value is not being formatted correctly for the log message. Looks like it is formatting the time in nanoseconds rather than seconds. I'll create a follow-on PR to clean that up.
FYI for message cleanup Upstream PR: https://github.com/google/cadvisor/pull/1785 Origin PR: https://github.com/openshift/origin/pull/17088
The new PR has not merged yet (https://github.com/openshift/origin/pull/17088). Moving back to MODIFIED
Verified duration timeout period shows correct value in seconds on OCP Cluster m4.xlarge instance type on EC2, v3.7.0-0.190.0, 1 master/etcd. 1 infra node and 2 compute nodes. "du and find" log messages show adaptive duration for timeout period starting with 2 seconds and ending up with about 13 seconds after 1 hour with 500 pods per node. # openshift version openshift v3.7.0-0.190.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 # journalctl | grep -i "du and find" | head -1 Nov 03 19:01:25 ip-172-31-10-208.us-west-2.compute.internal atomic-openshift-node[15899]: I1103 19:01:25.675164 15899 fsHandler.go:135] du and find on following dirs took 1.103386407s: [/var/lib/docker/overlay2/1b0e691253eb2d926dae2cf4bbdbed7cd39db2811346435c55f3d2cb76b9dea8/diff /var/lib/docker/containers/7559b82305fd84657385cdf06be17f61cc6949e27e04217c0bb4de7b5393a14d]; will not log again for this container unless duration exceeds 2s after 1 hour with 500 pods deployed on compute node, timeout duration now showing 13 seconds: # journalctl | grep -i "du and find" | tail -1 Nov 03 20:52:51 ip-172-31-10-208.us-west-2.compute.internal atomic-openshift-node[15899]: I1103 20:52:51.550766 15899 fsHandler.go:135] du and find on following dirs took 12.081128767s: [/var/lib/docker/overlay2/3c506ad7a98b2e26b2f6850ab3c447fe8b5a1b364dae67f0fb08ced58455609c/diff /var/lib/docker/containers/e5b6a944d8b68776c6b1d9ad737b5d496a836317f619467a91ef1ef8c9d0eb48]; will not log again for this container unless duration exceeds 13s # journalctl | grep -c "du and find" 12010
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188