Bug 1368155

Summary: schedulercache panics on master controller during scale run @ 500 nodes/13K project/52K pods
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NodeAssignee: Clayton Coleman <ccoleman>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: aos-bugs, ccoleman, jeder, jokerman, mifiedle, mmccomas, tdawson, tstclair, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-27 09:45:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Fiedler 2016-08-18 14:30:00 UTC
Description of problem: 

This is on the CNCF scale cluster.  During a run attempting to scale up to 500 nodes/20K projects/80K pods things went fairly smoothly up to the 13K project/ 52K pod level when repeated panics started showing in the master controller logs.   Prior to that there were a few deployments which timed out but could be successfully re-attempted.    

Two nodes were also marked unshedulable earlier due to Docker devicemapper issue.

There were two types of panics - a nil pointer dereference panic and "key found in assumed set but not in podStates"

The first nil pointer dereference panic is a large burst of them starting at Aug 17 20:57:53

Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: I0817 20:27:54.715542  117702 vnids.go:114] Associate netid 12467 to namespace "cncf87445"
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: E0817 20:27:54.990751  117702 runtime.go:52] Recovered from panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:472
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:443
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:62
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/sigpanic_unix.go:24
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:317
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:303
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:299
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:86
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:87
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:1998


The repeated "key found in assumed set but not in podStates" panic starts Aug 18 08:22:51

Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.714270   93951 vnids.go:114] Associate netid 12669 to namespace "cncfa156"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: E0818 08:22:51.788621   71171 runtime.go:52] Recovered from panic: "Key found in assumed set but not in podStates. Potentially a logical error." (Key found in assumed set but not in podStates. Potentially a logical error.)
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:472
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:443
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:315
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:303
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:299
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:86
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:87
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:1998
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.945076   93951 vnids.go:114] Associate netid 12670 to namespace "cncfa153"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.954411   93951 vnids.go:114] Associate netid 12671 to namespace "cncfa154"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.963181   93951 vnids.go:114] Associate netid 12672 to namespace "cncfa155"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.970751   93951 vnids.go:114] Associate netid 12673 to namespace "cncfa151"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.977170   93951 vnids.go:114] Associate netid 12674 to namespace "cncfa152"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.982920   93951 vnids.go:114] Associate netid 12675 to namespace "cncfa157"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.991279   93951 vnids.go:114] Associate netid 12676 to namespace "cncfa158"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.995738   93951 vnids.go:114] Associate netid 12677 to namespace "cncfa159"




Version-Release number of selected component (if applicable):

3.3.0.18


Environment:

3 m4.10xlarge etcd (40 vCPU, 160GB memory)
3 m4.10xlarge masters
1 m4.4xlarge master load balancer
2 m4.4xlarge infra (router/registry)
500 m4.xlarge nodes

Each project has

1 user
3 dc w/4 running pods
3 rc
3 svc
3 routes
3 bc
6 builds
1 is
20 svc

Will attach system log and master-config shortly

Comment 2 Clayton Coleman 2016-08-18 17:51:53 UTC
Fixed in https://github.com/kubernetes/kubernetes/pull/29093

Comment 3 Clayton Coleman 2016-08-18 18:37:59 UTC
Fixed in https://github.com/openshift/origin/pull/10518

Comment 4 Mike Fiedler 2016-08-18 19:20:29 UTC
MODIFIED until in a puddle

Comment 5 Troy Dawson 2016-08-19 20:57:05 UTC
This has been merged into ose and is in OSE v3.3.0.23 or newer.

Comment 7 Mike Fiedler 2016-08-23 18:50:01 UTC
Verified this on 3.3.0.23 - panic is gone for the same type of run.   Will continue to watch for it in other horizontal scale runs.

Comment 9 errata-xmlrpc 2016-09-27 09:45:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933