Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1389770 - Node panics repeatedly with unkeyable object error after migrating storage etcd2->etcd3 with node up
Node panics repeatedly with unkeyable object error after migrating storage et...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.4.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Dan Williams
Mike Fiedler
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-10-28 10:25 EDT by Mike Fiedler
Modified: 2017-03-08 13 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-01-18 07:47:45 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Node log (514.30 KB, text/plain)
2016-10-28 10:25 EDT, Mike Fiedler
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Origin (Github) 11792 None None None 2016-11-07 08:39 EST
Red Hat Product Errata RHBA-2017:0066 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 12:23:26 EST

  None (edit)
Description Mike Fiedler 2016-10-28 10:25:58 EDT
Created attachment 1215014 [details]
Node log

Description of problem:

After migrating etcd storage from V2 to V3 and configuring the API servers to use storage-backend=etcd3, the nodes (which were not stopped during this time) started panic-ing repeatedly when the api servers came back up.

Sample:

Oct 27 19:59:52 localhost atomic-openshift-node: E1027 19:59:52.799799   16576 runtime.go:64] Observed a panic: "unkeyable object: {svt664 &TypeMeta{Kind:,APIVersion:,}}, object has no meta: object does not implement the Object interfaces" (unkeyable object: {svt664 &TypeMeta{Kind:,APIVersion:,}}, object has no meta: object does not implement the Object interfaces)
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:479
Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/panic.go:458
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/pkg/sdn/plugin/eventqueue.go:187
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/pkg/sdn/plugin/eventqueue.go:34
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:573
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:312
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:490
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:343
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:271
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/reflector.go:202
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:88
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:89
Oct 27 19:59:52 localhost atomic-openshift-node: /builddir/build/BUILD/atomic-openshift-git-0.cc70b72/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/wait/wait.go:49
Oct 27 19:59:52 localhost atomic-openshift-node: /usr/lib/golang/src/runtime/asm_amd64.s:2086


I let it run for about 10 minutes and there was no recovery.   I'll run again and see if there is eventual recovery.


Version-Release number of selected component (if applicable): 3.4.0.16


How reproducible: always


Steps to Reproduce:
1.  Install an HA cluster (3 masters, 3 etcd) with OCP 3.4.0.16 + etcd 2.3.7
2.  Create projects with running deployments
3.  Shutdown masters and etcd.   Leave OpenShift nodes running.
4.  On each etcd:  yum swap etcd3 etcd to install etcd3 3.0.12-3. 
5.  On each etcd:  etcdctl migrate --data-dir /var/lib/etcd
6.  Start etcd on each 
7.  Start OpenShift masters

Actual results:

Nodes will get repeated panics (see above).  Cluster is inoperable - no operations involving nodes work.

Expected results:

Nodes recover and re-set their watches/lists when an etcd API version change occurs without having to restart the entire cluster.
Comment 1 Timothy St. Clair 2016-10-28 11:19:02 EDT
It's expected that ResourceVersion be out of date and force a re-list.  It's not expected to panic or cause an outage.
Comment 2 Timothy St. Clair 2016-10-31 14:47:19 EDT
This panic is inside the openshift sdn code on a object conversion, re-assigning.
Comment 3 Dan Williams 2016-11-04 14:54:14 EDT
upstream fix: https://github.com/openshift/origin/pull/11792
Comment 4 Troy Dawson 2016-11-09 14:43:23 EST
This has been merged into ose and is in OSE v3.4.0.24 or newer.
Comment 8 Mike Fiedler 2016-11-10 03:50:04 EST
Verified in 3.4.0.24.  The non-restarted node no longer panics when the master is brought up in etcd3 storage mode.   There are other issues with the node communicating with the master, but this issue is gone.
Comment 10 errata-xmlrpc 2017-01-18 07:47:45 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.