Bug 1328913 - Long running reliability tests show network errors on nodes
Summary: Long running reliability tests show network errors on nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.0
Assignee: Seth Jennings
QA Contact: Vikas Laad
URL:
Whiteboard:
: 1357052 (view as bug list)
Depends On: 1367141
Blocks: OSOPS_V3 1496245
TreeView+ depends on / blocked
 
Reported: 2016-04-20 14:33 UTC by Vikas Laad
Modified: 2018-01-08 13:57 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1496245 (view as bug list)
Environment:
Last Closed: 2017-11-28 21:51:43 UTC
Target Upstream Version:


Attachments (Terms of Use)
dm-failure.log (38.76 KB, text/plain)
2016-07-15 21:58 UTC, Seth Jennings
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1367141 unspecified CLOSED docker not removing cgroups for exited containers 2020-10-14 00:28:05 UTC
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Internal Links: 1367141

Description Vikas Laad 2016-04-20 14:33:11 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Vikas Laad 2016-04-20 14:44:21 UTC
Description of problem:
Long running reliability tests have run into this problem many times, these tests created few sample applications (ruby, cakephp, dancer and eap) applications in the beginning and keep rebuilding/accessing the apps for few days. I have the output if network debug script output but could not attach to the bug due to size. Please ping me on IRC vlaad is my nick.

Version-Release number of selected component (if applicable):
openshift v3.2.0.16

How reproducible:


Steps to Reproduce:
1. Please see the description

Actual results:
Logs are full of following errors 
Apr 20 10:42:31 ip-172-31-7-135 atomic-openshift-node: I0420 10:42:31.779154   63549 helpers.go:101] Unable to get network stats from pid 128265: couldn't read network stats: failure opening /proc/128265/net/dev: open /proc/128265/net/dev: no such file or directory


Expected results:


Additional info:

Comment 2 Vikas Laad 2016-04-20 14:45:22 UTC
I still have the environment running in case someone wants to look at it.

Comment 3 Ben Bennett 2016-04-20 15:56:31 UTC
These messages are usually from cadvisor / heapster trying to get the stats from a pod that's just gone away.  Are you getting messages repeated for the same pid for a long time?  Or are the pids changing.  If they are changing, I'm told that it's somewhat expected (but annoying) behavior.

Comment 4 Vikas Laad 2016-04-20 17:26:59 UTC
pids are changing.

Comment 5 dhodovsk 2016-06-03 14:34:17 UTC
Could you please try to reproduce the issue again with higher log level (5 could be fine) and send me the logs?

Comment 6 Vikas Laad 2016-06-06 15:24:04 UTC
I am running the tests, will update the bug with information when I have it reproduced.

Comment 8 Sten Turpin 2016-07-15 15:21:41 UTC
*** Bug 1357052 has been marked as a duplicate of this bug. ***

Comment 9 Seth Jennings 2016-07-15 21:39:32 UTC
The direct cause is that the container process has exited but cadvisor continues to try to monitor it.

This is because docker containers are removed from cadvisor monitoring implicitly by the removal of their cgroup, which cadvisor watches for with inotify.

A disconnect is occurring where the pid 1 of the docker container exits but the cgroup isn't removed, leading to a systemd docker container slice with no tasks:

# find /sys/fs/cgroup/systemd/system.slice/docker* -name tasks | xargs wc -l | sort -rn
...
0 /sys/fs/cgroup/systemd/system.slice/docker-4af4dc9c32a97a7a0bf0f26464898426389f18228dedf835e1ec8bad61d4c623.scope/tasks

Comment 10 Seth Jennings 2016-07-15 21:58:20 UTC
Created attachment 1180293 [details]
dm-failure.log

Attaching log with selected section regarding a container that resulted in an dead cgroup.  It is showing a device mapper removal failure that might be causing docker to bail out before the cgroup teardown.

Comment 11 Seth Jennings 2016-08-08 20:33:01 UTC
Upstream issue https://github.com/kubernetes/kubernetes/issues/30171

Comment 14 Derek Carr 2017-07-21 15:05:15 UTC
Opened a PR to cadvisor to reduce log spam:
https://github.com/google/cadvisor/pull/1700

Comment 17 Seth Jennings 2017-09-06 21:31:02 UTC
Origin PR:
https://github.com/openshift/origin/pull/16189

This will reduce the spam but good log rotation practices must still be used to avoid filling the disk.

Comment 20 Neeraj 2017-09-26 09:15:03 UTC
Please let me know if this bug is backported in 3.2.1.

Comment 26 Vikas Laad 2017-10-26 17:15:53 UTC
I do not see these errors at logleve=2 in following version and later

openshift v3.7.0-0.143.2
kubernetes v1.7.0+80709908fd
etcd 3.2.1

Comment 30 errata-xmlrpc 2017-11-28 21:51:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.