Bug 1498632
Summary: | OCP 3.7 syslog and journal filling up with looping "du and find on following dirs took..." messages when exceeding 450 pods per node | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Walid A. <wabouham> | ||||||
Component: | Node | Assignee: | Seth Jennings <sjenning> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Walid A. <wabouham> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.7.0 | CC: | aos-bugs, jeder, jokerman, mifiedle, mmccomas, rgudimet, sjenning, xtian | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.7.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: |
undefined
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-11-28 22:15:02 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Walid A.
2017-10-04 19:54:47 UTC
Created attachment 1334447 [details]
top command output on one of the compute node with looping messages
Walid, I noticed that too, when I was running 512 pods/node on a 20 node cluster. But nodes in my cluster, I had 512 pods running for the past 7 days and the count in messages is: /var/log # grep -c "du and find" messages 3815820 So, filling disk space at this rate is really alarming. Roughly, it would take The offending line in code seems to be: https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L131 where glog.V(2) is being used. and in OCP, we have default log level set to 2. So, this line is continuously being logged. There are 2 solutions that I could think of. - Increase the glog.V(2) -> higher value like glog.V(4) or above. - Increase the timeout at https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L130(which is always 1sec, irrespective of number of directories being listed) (In reply to ravig from comment #3) > Walid, > > I noticed that too, when I was running 512 pods/node on a 20 node cluster. > But nodes in my cluster, I had 512 pods running for the past 7 days and the > count in messages is: > > /var/log # grep -c "du and find" messages > 3815820 > > So, filling disk space at this rate is really alarming. Roughly, it would > take > > The offending line in code seems to be: > https://github.com/google/cadvisor/blob/master/container/common/fsHandler. > go#L131 > > where glog.V(2) is being used. > > and in OCP, we have default log level set to 2. So, this line is > continuously being logged. > > There are 2 solutions that I could think of. > > - Increase the glog.V(2) -> higher value like glog.V(4) or above. > > - Increase the timeout at > https://github.com/google/cadvisor/blob/master/container/common/fsHandler. > go#L130(which is always 1sec, irrespective of number of directories being > listed) Roughly, it would take 23-25 days to fill up 10GB of disk at this rate and log level. cadvisor PR: https://github.com/google/cadvisor/pull/1766 Verified fix on OCP Cluster m4.xlarge instance type on EC2, v3.7.0-0.178.0, 1 master/etcd. 1 infra node and 2 compute nodes. "du and find" log messages no longer continuously loop with 500 pausepods per compute node. # openshift version openshift v3.7.0-0.178.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 After 4 hours with 500 pods running on that compute node: # journalctl -r | grep -c "du and find" 9116 # journalctl -r | grep "du and find" | head -1 Oct 27 15:46:06 ip-172-31-50-56.us-west-2.compute.internal atomic-openshift-node[15749]: I1027 15:46:06.593787 15749 fsHandler.go:135] du and find on following dirs took 10.364326077s: [/var/lib/docker/overlay2/6af7b0be155344d1727089ff5d33e4d0aa50705a3bce31d6ab7f0d910ecc4a8a/diff /var/lib/docker/containers/2c77970c4140884c62deb5de46597703ab68af470ac5362e7a3571e4f00b0745]; will not log again for this container unless duration exceeds 11000000000 seconds. Do we want to keep the timeout duration at this very large value of 11000000000 seconds ? Or choose a reasonably lower value ? Moving this back to ON_QA to get confirmation that the above message is expected. @Seth Is 11000000000 seconds the expected value? I took a quick look at the code and it wasn't clear that's how the adaptive backoff should trigger. So the logic correct but the duration value is not being formatted correctly for the log message. Looks like it is formatting the time in nanoseconds rather than seconds. I'll create a follow-on PR to clean that up. FYI for message cleanup Upstream PR: https://github.com/google/cadvisor/pull/1785 Origin PR: https://github.com/openshift/origin/pull/17088 The new PR has not merged yet (https://github.com/openshift/origin/pull/17088). Moving back to MODIFIED Verified duration timeout period shows correct value in seconds on OCP Cluster m4.xlarge instance type on EC2, v3.7.0-0.190.0, 1 master/etcd. 1 infra node and 2 compute nodes. "du and find" log messages show adaptive duration for timeout period starting with 2 seconds and ending up with about 13 seconds after 1 hour with 500 pods per node. # openshift version openshift v3.7.0-0.190.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 # journalctl | grep -i "du and find" | head -1 Nov 03 19:01:25 ip-172-31-10-208.us-west-2.compute.internal atomic-openshift-node[15899]: I1103 19:01:25.675164 15899 fsHandler.go:135] du and find on following dirs took 1.103386407s: [/var/lib/docker/overlay2/1b0e691253eb2d926dae2cf4bbdbed7cd39db2811346435c55f3d2cb76b9dea8/diff /var/lib/docker/containers/7559b82305fd84657385cdf06be17f61cc6949e27e04217c0bb4de7b5393a14d]; will not log again for this container unless duration exceeds 2s after 1 hour with 500 pods deployed on compute node, timeout duration now showing 13 seconds: # journalctl | grep -i "du and find" | tail -1 Nov 03 20:52:51 ip-172-31-10-208.us-west-2.compute.internal atomic-openshift-node[15899]: I1103 20:52:51.550766 15899 fsHandler.go:135] du and find on following dirs took 12.081128767s: [/var/lib/docker/overlay2/3c506ad7a98b2e26b2f6850ab3c447fe8b5a1b364dae67f0fb08ced58455609c/diff /var/lib/docker/containers/e5b6a944d8b68776c6b1d9ad737b5d496a836317f619467a91ef1ef8c9d0eb48]; will not log again for this container unless duration exceeds 13s # journalctl | grep -c "du and find" 12010 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |