Created attachment 1215014 [details] Node log Description of problem: After migrating etcd storage from V2 to V3 and configuring the API servers to use storage-backend=etcd3, the nodes (which were not stopped during this time) started panic-ing repeatedly when the api servers came back up. Sample: Oct 27 19:59:52 localhost atomic-openshift-node: E1027 19:59:52.799799 16576 runtime.go:64] Observed a panic: "unkeyable object: {svt664 &TypeMeta{Kind:,APIVersion:,}}, object has no meta: object does not implement the Object interfaces" (unkeyable object: {svt664 &TypeMeta{Kind:,APIVersion:,}}, object has no meta: object does not implement the Object interfaces) Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49 Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:479 Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/panic.go:458 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/pkg/sdn/plugin/eventqueue.go:187 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/pkg/sdn/plugin/eventqueue.go:34 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:573 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:312 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:490 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:343 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:271 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:202 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:88 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:89 Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49 Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:2086 I let it run for about 10 minutes and there was no recovery. I'll run again and see if there is eventual recovery. Version-Release number of selected component (if applicable): 3.4.0.16 How reproducible: always Steps to Reproduce: 1. Install an HA cluster (3 masters, 3 etcd) with OCP 3.4.0.16 + etcd 2.3.7 2. Create projects with running deployments 3. Shutdown masters and etcd. Leave OpenShift nodes running. 4. On each etcd: yum swap etcd3 etcd to install etcd3 3.0.12-3. 5. On each etcd: etcdctl migrate --data-dir /var/lib/etcd 6. Start etcd on each 7. Start OpenShift masters Actual results: Nodes will get repeated panics (see above). Cluster is inoperable - no operations involving nodes work. Expected results: Nodes recover and re-set their watches/lists when an etcd API version change occurs without having to restart the entire cluster.
It's expected that ResourceVersion be out of date and force a re-list. It's not expected to panic or cause an outage.
This panic is inside the openshift sdn code on a object conversion, re-assigning.
upstream fix: https://github.com/openshift/origin/pull/11792
This has been merged into ose and is in OSE v3.4.0.24 or newer.
Verified in 3.4.0.24. The non-restarted node no longer panics when the master is brought up in etcd3 storage mode. There are other issues with the node communicating with the master, but this issue is gone.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066