Bug 1498632

Summary:

OCP 3.7 syslog and journal filling up with looping "du and find on following dirs took..." messages when exceeding 450 pods per node

Product:

OpenShift Container Platform

Reporter:

Walid A. <wabouham>

Component:

Node

Assignee:

Seth Jennings <sjenning>

Status:

CLOSED ERRATA

QA Contact:

Walid A. <wabouham>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.7.0

CC:

aos-bugs, jeder, jokerman, mifiedle, mmccomas, rgudimet, sjenning, xtian

Target Milestone:

---

Target Release:

3.7.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-11-28 22:15:02 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
oc describe node output	none
top command output on one of the compute node with looping messages	none

Description Walid A. 2017-10-04 19:54:47 UTC

Created attachment 1334446 [details]
oc describe node output

Description of problem:
OCP 3.7 should be able to handle 500 pods per compute node.  When deploying 450 pausepods (gcr.io/google_containers/pause-amd64:3.0) or higher, syslog and journal start filling up with lopping messages on the compute nodes:

Oct 04 18:30:12 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:12.913801  121634 fsHandler.go:131] du and find on following dirs took 1.033634575s: [/var/lib/docker/overlay2/db37f9c43f4922f28eea582ab57a59b19a1125d04d7a47be7fe45c37996e3279 /var/lib/docker/containers/0f0fbd7daa586d4ff66a8dc575efda6560d7d081b21cef6b4324e42acc2c162d]
Oct 04 18:30:12 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:12.948987  121634 fsHandler.go:131] du and find on following dirs took 1.407000223s: [/var/lib/docker/overlay2/65cdb97fba8477dc02e37d9dcff3c114dabedf48ab4fa8da8f35a87d48cf10f5 /var/lib/docker/containers/12da28aa7e65933c889859af1930b11348d25ecf4a5bb041f15a2b071143192b]
Oct 04 18:30:13 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:13.020452  121634 fsHandler.go:131] du and find on following dirs took 2.546725632s: [/var/lib/docker/overlay2/27de9d8f5420dc3939bb1511d294979165f620d0de0837cdaa87377889700343 /var/lib/docker/containers/0d7ab284078da73a7bba280b781ff8792aaaf9579dba0c325f56b20ff24c760d]
Oct 04 18:30:13 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:13.041726  121634 fsHandler.go:131] du and find on following dirs took 1.648106469s: [/var/lib/docker/overlay2/80ed41d0b54316c9dfa7344a0b7b6698b9c97d5bf90fcc434020b31dc8567100 /var/lib/docker/containers/53e240765945e23f4e08119eec18420133d2792c35812f2593aeda4d6beece72]

This is seen on AWC EC2 clusters ( 1 master etcd, 1 infra nodes, 2 compute nodes) with instant type m4.xlarge (4 vCPU, 16 GB Memory, 750 Mbps dedicated EBS Bandwidth) and m4.2xlarge (8 vCPUs, 32GB memory, 1000 Mbps dedicated EBS Bandwidth).

Similar messages were reported in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1459265 and pointed to cadvisor normal behavior.

The compute nodes with these looping log messages are not starved for resources, overall node CPU utilization was around 30%, memory used 35%, 55 peak IOPS, and 13 Mbits/sec peak Network).  CPU and Memory requests for pause-pods are 0%. 

Concern is running out of disk space over time as several millions of these messages are written to the logs in a space of a day or two.  On one compute node after leaving the pods running over night and repeating the test in a span of 2 days:
/var/log # grep -c "du and find" messages
3418737
/var/log # grep -c "du and find" messages-20171001
1488200


Version-Release number of selected component (if applicable):
openshift v3.7.0-0.127.0
kubernetes v1.7.0+80709908fd
etcd 3.2.1

# docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-61.git85d7426.el7.x86_64
 Go version:      go1.8.3
 Git commit:      85d7426/1.12.6
 Built:           Tue Sep 26 15:30:51 2017
 OS/Arch:         linux/amd64

How reproducible:
All the time as long as you have 450 or more of the pause pods deployed per compute node.  We ended up with 500 pause pods per node

Steps to Reproduce:
1. Git clone the openshift svt repo
2. cd svt/openshift_scalability
Run cluster-loader.py (https://github.com/openshift/svt/blob/master/openshift_scalability/README.md) with config file: https://github.com/openshift/svt/blob/master/openshift_scalability/config/nodeVertical.yaml:

python ./cluster-loader.py -vf config/nodeVertical.yaml

3. Wait till you have deployed at least 450 pause pods per node.  The script will deploy 500 pods per compute node (total 1000 per cluster)

4.  tail -f /var/log/messages or journalctl -f:

Actual results:
syslog and journal are filling up with looping messages such as this one:
Oct 04 18:30:12 ip-172-31-61-211.us-west-2.compute.internal atomic-openshift-node[121634]: I1004 18:30:12.913801  121634 fsHandler.go:131] du and find on following dirs took 1.033634575s: 



Expected results:
These log messages should not be constantly written and looping

Additional info:
pbench benchmarking data and link to syslog messages file from one of the compute nodes with 500 pause pods will be sent in private comment

Comment 2 Walid A. 2017-10-04 20:05:57 UTC

Created attachment 1334447 [details]
top command output on one of the compute node with looping messages

Comment 3 ravig 2017-10-05 02:36:25 UTC

Walid,

I noticed that too, when I was running 512 pods/node on a 20 node cluster. But nodes in my cluster, I had 512 pods running for the past 7 days and the count in messages is:

/var/log # grep -c "du and find" messages
3815820

So, filling disk space at this rate is really alarming. Roughly, it would take 

The offending line in code seems to be:
https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L131

where glog.V(2) is being used.

and in OCP, we have default log level set to 2. So, this line is continuously being logged.

There are 2 solutions that I could think of.

- Increase the glog.V(2) -> higher value like glog.V(4) or above.

- Increase the timeout at https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L130(which is always 1sec, irrespective of number of directories being listed)

Comment 4 ravig 2017-10-05 02:39:44 UTC

(In reply to ravig from comment #3)
> Walid,
> 
> I noticed that too, when I was running 512 pods/node on a 20 node cluster.
> But nodes in my cluster, I had 512 pods running for the past 7 days and the
> count in messages is:
> 
> /var/log # grep -c "du and find" messages
> 3815820
> 
> So, filling disk space at this rate is really alarming. Roughly, it would
> take 
> 
> The offending line in code seems to be:
> https://github.com/google/cadvisor/blob/master/container/common/fsHandler.
> go#L131
> 
> where glog.V(2) is being used.
> 
> and in OCP, we have default log level set to 2. So, this line is
> continuously being logged.
> 
> There are 2 solutions that I could think of.
> 
> - Increase the glog.V(2) -> higher value like glog.V(4) or above.
> 
> - Increase the timeout at
> https://github.com/google/cadvisor/blob/master/container/common/fsHandler.
> go#L130(which is always 1sec, irrespective of number of directories being
> listed)

Roughly, it would take 23-25 days to fill up 10GB of disk at this rate and log level.

Comment 5 Seth Jennings 2017-10-05 03:18:21 UTC

cadvisor PR:
https://github.com/google/cadvisor/pull/1766

Comment 6 Seth Jennings 2017-10-05 14:53:54 UTC

Origin PR:
https://github.com/openshift/origin/pull/16697

Comment 8 Walid A. 2017-10-27 20:14:54 UTC

Verified fix on OCP Cluster m4.xlarge instance type on EC2, v3.7.0-0.178.0, 1 master/etcd. 1 infra node and 2 compute nodes.  "du and find" log messages no longer continuously loop with 500 pausepods per compute node.

# openshift version
openshift v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

After 4 hours with 500 pods running on that compute node:
# journalctl -r | grep -c "du and find"
9116
# journalctl -r | grep "du and find" | head -1
Oct 27 15:46:06 ip-172-31-50-56.us-west-2.compute.internal atomic-openshift-node[15749]: I1027 15:46:06.593787   15749 fsHandler.go:135] du and find on following dirs took 10.364326077s: [/var/lib/docker/overlay2/6af7b0be155344d1727089ff5d33e4d0aa50705a3bce31d6ab7f0d910ecc4a8a/diff /var/lib/docker/containers/2c77970c4140884c62deb5de46597703ab68af470ac5362e7a3571e4f00b0745]; will not log again for this container unless duration exceeds 11000000000 seconds.

Do we want to keep the timeout duration at this very large value of 11000000000 seconds ?  Or choose a reasonably lower value ?

Comment 9 Mike Fiedler 2017-10-30 14:22:21 UTC

Moving this back to ON_QA to get confirmation that the above message is expected.

@Seth Is 11000000000 seconds the expected value?  I took a quick look at the code and it wasn't clear that's how the adaptive backoff should trigger.

Comment 10 Seth Jennings 2017-10-30 14:42:41 UTC

So the logic correct but the duration value is not being formatted correctly for the log message. Looks like it is formatting the time in nanoseconds rather than seconds.  I'll create a follow-on PR to clean that up.

Comment 11 Seth Jennings 2017-10-30 15:05:56 UTC

FYI for message cleanup

Upstream PR:
https://github.com/google/cadvisor/pull/1785

Origin PR:
https://github.com/openshift/origin/pull/17088

Comment 13 Mike Fiedler 2017-11-01 13:53:31 UTC

The new PR has not merged yet (https://github.com/openshift/origin/pull/17088).   Moving back to MODIFIED

Comment 15 Walid A. 2017-11-03 21:13:28 UTC

Verified duration timeout period shows correct value in seconds on OCP Cluster m4.xlarge instance type on EC2, v3.7.0-0.190.0, 1 master/etcd. 1 infra node and 2 compute nodes.  "du and find" log messages show adaptive duration for timeout period starting with 2 seconds and ending up with about 13 seconds after 1 hour with 500 pods per node.

# openshift version
openshift v3.7.0-0.190.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

# journalctl | grep -i "du and find" | head -1
Nov 03 19:01:25 ip-172-31-10-208.us-west-2.compute.internal atomic-openshift-node[15899]: I1103 19:01:25.675164   15899 fsHandler.go:135] du and find on following dirs took 1.103386407s: [/var/lib/docker/overlay2/1b0e691253eb2d926dae2cf4bbdbed7cd39db2811346435c55f3d2cb76b9dea8/diff /var/lib/docker/containers/7559b82305fd84657385cdf06be17f61cc6949e27e04217c0bb4de7b5393a14d]; will not log again for this container unless duration exceeds 2s

after 1 hour with 500 pods deployed on compute node, timeout duration now showing 13 seconds:
 
# journalctl | grep -i "du and find" | tail  -1
Nov 03 20:52:51 ip-172-31-10-208.us-west-2.compute.internal atomic-openshift-node[15899]: I1103 20:52:51.550766   15899 fsHandler.go:135] du and find on following dirs took 12.081128767s: [/var/lib/docker/overlay2/3c506ad7a98b2e26b2f6850ab3c447fe8b5a1b364dae67f0fb08ced58455609c/diff /var/lib/docker/containers/e5b6a944d8b68776c6b1d9ad737b5d496a836317f619467a91ef1ef8c9d0eb48]; will not log again for this container unless duration exceeds 13s

# journalctl | grep -c "du and find"
12010

Comment 18 errata-xmlrpc 2017-11-28 22:15:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188