Bug 1389770 - Node panics repeatedly with unkeyable object error after migrating storage etcd2->etcd3 with node up
Summary: Node panics repeatedly with unkeyable object error after migrating storage et...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-28 14:25 UTC by Mike Fiedler
Modified: 2017-03-08 18:43 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-01-18 12:47:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Node log (514.30 KB, text/plain)
2016-10-28 14:25 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 11792 0 None None None 2016-11-07 13:39:20 UTC
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Mike Fiedler 2016-10-28 14:25:58 UTC
Created attachment 1215014 [details]
Node log

Description of problem:

After migrating etcd storage from V2 to V3 and configuring the API servers to use storage-backend=etcd3, the nodes (which were not stopped during this time) started panic-ing repeatedly when the api servers came back up.

Sample:

Oct 27 19:59:52 localhost atomic-openshift-node: E1027 19:59:52.799799   16576 runtime.go:64] Observed a panic: "unkeyable object: {svt664 &TypeMeta{Kind:,APIVersion:,}}, object has no meta: object does not implement the Object interfaces" (unkeyable object: {svt664 &TypeMeta{Kind:,APIVersion:,}}, object has no meta: object does not implement the Object interfaces)
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:479
Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/panic.go:458
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/pkg/sdn/plugin/eventqueue.go:187
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/pkg/sdn/plugin/eventqueue.go:34
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:573
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:312
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:490
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:343
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:271
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:202
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:88
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:89
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49
Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:2086


I let it run for about 10 minutes and there was no recovery.   I'll run again and see if there is eventual recovery.


Version-Release number of selected component (if applicable): 3.4.0.16


How reproducible: always


Steps to Reproduce:
1.  Install an HA cluster (3 masters, 3 etcd) with OCP 3.4.0.16 + etcd 2.3.7
2.  Create projects with running deployments
3.  Shutdown masters and etcd.   Leave OpenShift nodes running.
4.  On each etcd:  yum swap etcd3 etcd to install etcd3 3.0.12-3. 
5.  On each etcd:  etcdctl migrate --data-dir /var/lib/etcd
6.  Start etcd on each 
7.  Start OpenShift masters

Actual results:

Nodes will get repeated panics (see above).  Cluster is inoperable - no operations involving nodes work.

Expected results:

Nodes recover and re-set their watches/lists when an etcd API version change occurs without having to restart the entire cluster.

Comment 1 Timothy St. Clair 2016-10-28 15:19:02 UTC
It's expected that ResourceVersion be out of date and force a re-list.  It's not expected to panic or cause an outage.

Comment 2 Timothy St. Clair 2016-10-31 18:47:19 UTC
This panic is inside the openshift sdn code on a object conversion, re-assigning.

Comment 3 Dan Williams 2016-11-04 18:54:14 UTC
upstream fix: https://github.com/openshift/origin/pull/11792

Comment 4 Troy Dawson 2016-11-09 19:43:23 UTC
This has been merged into ose and is in OSE v3.4.0.24 or newer.

Comment 8 Mike Fiedler 2016-11-10 08:50:04 UTC
Verified in 3.4.0.24.  The non-restarted node no longer panics when the master is brought up in etcd3 storage mode.   There are other issues with the node communicating with the master, but this issue is gone.

Comment 10 errata-xmlrpc 2017-01-18 12:47:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.