Bug 1368155 - schedulercache panics on master controller during scale run @ 500 nodes/13K project/52K pods
Summary: schedulercache panics on master controller during scale run @ 500 nodes/13K p...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Clayton Coleman
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-18 14:30 UTC by Mike Fiedler
Modified: 2016-09-27 09:45 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-27 09:45:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 13:24:36 UTC

Description Mike Fiedler 2016-08-18 14:30:00 UTC
Description of problem: 

This is on the CNCF scale cluster.  During a run attempting to scale up to 500 nodes/20K projects/80K pods things went fairly smoothly up to the 13K project/ 52K pod level when repeated panics started showing in the master controller logs.   Prior to that there were a few deployments which timed out but could be successfully re-attempted.    

Two nodes were also marked unshedulable earlier due to Docker devicemapper issue.

There were two types of panics - a nil pointer dereference panic and "key found in assumed set but not in podStates"

The first nil pointer dereference panic is a large burst of them starting at Aug 17 20:57:53

Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: I0817 20:27:54.715542  117702 vnids.go:114] Associate netid 12467 to namespace "cncf87445"
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: E0817 20:27:54.990751  117702 runtime.go:52] Recovered from panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:472
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:443
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:62
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/sigpanic_unix.go:24
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:317
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:303
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:299
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:86
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:87
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49
Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:1998


The repeated "key found in assumed set but not in podStates" panic starts Aug 18 08:22:51

Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.714270   93951 vnids.go:114] Associate netid 12669 to namespace "cncfa156"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: E0818 08:22:51.788621   71171 runtime.go:52] Recovered from panic: "Key found in assumed set but not in podStates. Potentially a logical error." (Key found in assumed set but not in podStates. Potentially a logical error.)
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:472
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:443
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:315
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:303
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:299
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:86
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:87
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:1998
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.945076   93951 vnids.go:114] Associate netid 12670 to namespace "cncfa153"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.954411   93951 vnids.go:114] Associate netid 12671 to namespace "cncfa154"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.963181   93951 vnids.go:114] Associate netid 12672 to namespace "cncfa155"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.970751   93951 vnids.go:114] Associate netid 12673 to namespace "cncfa151"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.977170   93951 vnids.go:114] Associate netid 12674 to namespace "cncfa152"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.982920   93951 vnids.go:114] Associate netid 12675 to namespace "cncfa157"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.991279   93951 vnids.go:114] Associate netid 12676 to namespace "cncfa158"
Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.995738   93951 vnids.go:114] Associate netid 12677 to namespace "cncfa159"




Version-Release number of selected component (if applicable):

3.3.0.18


Environment:

3 m4.10xlarge etcd (40 vCPU, 160GB memory)
3 m4.10xlarge masters
1 m4.4xlarge master load balancer
2 m4.4xlarge infra (router/registry)
500 m4.xlarge nodes

Each project has

1 user
3 dc w/4 running pods
3 rc
3 svc
3 routes
3 bc
6 builds
1 is
20 svc

Will attach system log and master-config shortly

Comment 2 Clayton Coleman 2016-08-18 17:51:53 UTC
Fixed in https://github.com/kubernetes/kubernetes/pull/29093

Comment 3 Clayton Coleman 2016-08-18 18:37:59 UTC
Fixed in https://github.com/openshift/origin/pull/10518

Comment 4 Mike Fiedler 2016-08-18 19:20:29 UTC
MODIFIED until in a puddle

Comment 5 Troy Dawson 2016-08-19 20:57:05 UTC
This has been merged into ose and is in OSE v3.3.0.23 or newer.

Comment 7 Mike Fiedler 2016-08-23 18:50:01 UTC
Verified this on 3.3.0.23 - panic is gone for the same type of run.   Will continue to watch for it in other horizontal scale runs.

Comment 9 errata-xmlrpc 2016-09-27 09:45:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933


Note You need to log in before you can comment on or make changes to this bug.