1425301 – [3.3] Node 100% CPU usage with many stale container volumes and volume_stat_caculator panics

Bug 1425301 - [3.3] Node 100% CPU usage with many stale container volumes and volume_stat_caculator panics

Summary: [3.3] Node 100% CPU usage with many stale container volumes and volume_stat_c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.1
Assignee:	Seth Jennings
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-21 07:00 UTC by Takayoshi Kimura
Modified:	2020-05-14 15:39 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Fixes an issue where the openshift node logs a panic with a nil deference during volume teardown.
Clone Of:
Environment:
Last Closed:	2017-03-15 20:03:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0512	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.4.1.10, 3.3.1.17, and 3.2.1.28 bug fix update	2017-03-16 00:01:17 UTC

Description Takayoshi Kimura 2017-02-21 07:00:08 UTC

Description of problem:

For some reason there are a lot of stole container volumes and node process keeps to perform `du` commands, volume_stat_caculator panics and it's using 100% CPU.

It seems getting better by restarting node process.

atomic-openshift-node: I0217 fsHandler.go:116] `du` on following dirs took 1.047866932s: [ /var/lib/docker/containers/51eb2d3df53ee04711af3c688b7b8f05fff59793283d6372cbda26f90d55015b]

Logs will be attached.

* du logs

atomic-openshift-node: I0217 fsHandler.go:116] `du` on following dirs took 1.047866932s: [ /var/lib/docker/containers/51eb2d3df53ee04711af3c688b7b8f05fff59793283d6372cbda26f90d55015b]

* panic logs

atomic-openshift-node: I0220 operation_executor.go:824] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/57b508f8-f5e6-11e6-8d55-0050568348cc-default-token-d6lb3" (OuterVolumeSpecName: "default-token-d6lb3") pod "57b508f8-f5e6-11e6-8d55-0050568348cc" (UID: "57b508f8-f5e6-11e6-8d55-0050568348cc"). InnerVolumeSpecName "default-token-d6lb3". PluginName "kubernetes.io/secret", VolumeGidValue ""
atomic-openshift-node: E0220 19:32:52.565299  102547 runtime.go:52] Recovered from panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58
atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51
atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41
atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:472
atomic-openshift-node: /usr/lib/golang/src/runtime/panic.go:443
atomic-openshift-node: /usr/lib/golang/src/runtime/panic.go:62
atomic-openshift-node: /usr/lib/golang/src/runtime/sigpanic_unix.go:24
atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server/stats/volume_stat_caculator.go:98
atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/server/stats/volume_stat_caculator.go:63
atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:88
atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:89
atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:1998

- stole volumes

# df -h | grep openshift.local.volumes | grep tmpfs | wc -l
6329
# df -h | grep openshift.local.volumes | grep tmpfs | head -1
tmpfs                                   32G   12K   32G   1% /var/lib/origin/openshift.local.volumes/pods/ca4300a4-d2af-11e6-8a27-0050568379a2/volumes/kubernetes.io~secret/default-token-6jo4n

Version-Release number of selected component (if applicable):

atomic-openshift-3.3.1.7-1.git.0.0988966.el7.x86_64

How reproducible:

Not yet known. Once at customer env. No change performed at the time the issue started.

Steps to Reproduce:
1.
2.
3.

Actual results:

For some reason there are a lot of stole container volumes and node process keeps to perform `du` commands, volume_stat_caculator panics and it's using 100% CPU.

Expected results:

No stole volumes

Additional info:

Comment 12 Seth Jennings 2017-02-21 18:20:11 UTC

The panics are fixed in OCP 3.4 and higher by this:
https://github.com/openshift/origin/commit/4f830f3

I will backport to OSE 3.3.

Comment 18 Troy Dawson 2017-02-24 20:36:11 UTC

This has been merged into ocp and is in OCP v3.3.1.16 or newer.

Comment 20 DeShuai Ma 2017-03-03 07:46:10 UTC

Verify on ocp v3.3.1.16, no this panic on node.
[root@ip-172-18-12-128 ~]# openshift version
openshift v3.3.1.16
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

Comment 22 errata-xmlrpc 2017-03-15 20:03:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0512

Note You need to log in before you can comment on or make changes to this bug.