Description of problem: This is on the CNCF scale cluster. During a run attempting to scale up to 500 nodes/20K projects/80K pods things went fairly smoothly up to the 13K project/ 52K pod level when repeated panics started showing in the master controller logs. Prior to that there were a few deployments which timed out but could be successfully re-attempted. Two nodes were also marked unshedulable earlier due to Docker devicemapper issue. There were two types of panics - a nil pointer dereference panic and "key found in assumed set but not in podStates" The first nil pointer dereference panic is a large burst of them starting at Aug 17 20:57:53 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: I0817 20:27:54.715542 117702 vnids.go:114] Associate netid 12467 to namespace "cncf87445" Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: E0817 20:27:54.990751 117702 runtime.go:52] Recovered from panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:472 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:443 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:62 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/sigpanic_unix.go:24 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:317 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:303 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:299 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:86 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:87 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49 Aug 17 20:27:54 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:1998 The repeated "key found in assumed set but not in podStates" panic starts Aug 18 08:22:51 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.714270 93951 vnids.go:114] Associate netid 12669 to namespace "cncfa156" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: E0818 08:22:51.788621 71171 runtime.go:52] Recovered from panic: "Key found in assumed set but not in podStates. Potentially a logical error." (Key found in assumed set but not in podStates. Potentially a logical error.) Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:472 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/panic.go:443 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:315 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:303 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/scheduler/schedulercache/cache.go:299 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:86 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:87 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /builddir/build/BUILD/atomic-openshift-git-0.bca8829/_build/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-master-controllers: /usr/lib/golang/src/runtime/asm_amd64.s:1998 Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.945076 93951 vnids.go:114] Associate netid 12670 to namespace "cncfa153" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.954411 93951 vnids.go:114] Associate netid 12671 to namespace "cncfa154" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.963181 93951 vnids.go:114] Associate netid 12672 to namespace "cncfa155" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.970751 93951 vnids.go:114] Associate netid 12673 to namespace "cncfa151" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.977170 93951 vnids.go:114] Associate netid 12674 to namespace "cncfa152" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.982920 93951 vnids.go:114] Associate netid 12675 to namespace "cncfa157" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.991279 93951 vnids.go:114] Associate netid 12676 to namespace "cncfa158" Aug 18 08:22:51 mvirt-m-1 atomic-openshift-node: I0818 08:22:51.995738 93951 vnids.go:114] Associate netid 12677 to namespace "cncfa159" Version-Release number of selected component (if applicable): 3.3.0.18 Environment: 3 m4.10xlarge etcd (40 vCPU, 160GB memory) 3 m4.10xlarge masters 1 m4.4xlarge master load balancer 2 m4.4xlarge infra (router/registry) 500 m4.xlarge nodes Each project has 1 user 3 dc w/4 running pods 3 rc 3 svc 3 routes 3 bc 6 builds 1 is 20 svc Will attach system log and master-config shortly
Fixed in https://github.com/kubernetes/kubernetes/pull/29093
Fixed in https://github.com/openshift/origin/pull/10518
MODIFIED until in a puddle
This has been merged into ose and is in OSE v3.3.0.23 or newer.
Verified this on 3.3.0.23 - panic is gone for the same type of run. Will continue to watch for it in other horizontal scale runs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933