Bug 1389773

Summary: Unexpected object conversion after migration to etcd3 storage
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NodeAssignee: Timothy St. Clair <tstclair>
Status: CLOSED NOTABUG QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: aos-bugs, jokerman, mifiedle, mmccomas, tstclair
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-31 21:08:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
master log - search for "About to convert" and "failed to handle" none

Description Mike Fiedler 2016-10-28 14:41:36 UTC
Created attachment 1215015 [details]
master log - search for "About to convert" and "failed to handle"

Description of problem:

After migrating cluster storage from etcd v2 to etcd v3, there are a large number of object conversion and conversion failure messages in the logs when the masters are re-started.

Opening this BZ for confirmation that conversion is expected when the only change is to etcd storage - i.e. no format/schema/version change in the underlying OCP data.

Example (see attached log)

Oct 27 20:23:02 192 atomic-openshift-node: I1027 20:23:02.339957    2922 conversion.go:133] failed to handle multiple devices for container. Skipping Filesystem stats
Oct 27 20:23:02 192 atomic-openshift-node: I1027 20:23:02.340009    2922 conversion.go:133] failed to handle multiple devices for container. Skipping Filesystem stats
Oct 27 20:23:02 192 atomic-openshift-master-api: I1027 20:23:02.485103  120864 trace.go:61] Trace "Update /api/v1/nodes/192.1.18.20/status" (started 2016-10-27 20:22:55.868264451 -0400 EDT):
Oct 27 20:23:02 192 atomic-openshift-master-api: [57.021µs] [57.021µs] About to convert to expected version
Oct 27 20:23:02 192 atomic-openshift-master-api: [174.003µs] [116.982µs] Conversion done
Oct 27 20:23:02 192 atomic-openshift-master-api: [180.27µs] [6.267µs] About to store object in database
Oct 27 20:23:02 192 atomic-openshift-master-api: [6.616439956s] [6.616259686s] Object stored in database
Oct 27 20:23:02 192 atomic-openshift-master-api: [6.616456216s] [16.26µs] Self-link added
Oct 27 20:23:02 192 atomic-openshift-master-api: [6.616754191s] [297.975µs] END
Oct 27 20:23:02 192 atomic-openshift-master-api: I1027 20:23:02.485155  120864 trace.go:61] Trace "Update /api/v1/nodes/192.1.18.218/status" (started 2016-10-27 20:22:55.813897248 -0400 EDT):
Oct 27 20:23:02 192 atomic-openshift-master-api: [113.948µs] [113.948µs] About to convert to expected version
Oct 27 20:23:02 192 atomic-openshift-master-api: [306.242µs] [192.294µs] Conversion done
Oct 27 20:23:02 192 atomic-openshift-master-api: [315.259µs] [9.017µs] About to store object in database
Oct 27 20:23:02 192 atomic-openshift-master-api: [6.670909865s] [6.670594606s] Object stored in database
Oct 27 20:23:02 192 atomic-openshift-master-api: [6.670922331s] [12.466µs] Self-link added
Oct 27 20:23:02 192 atomic-openshift-master-api: [6.671170995s] [248.664µs] END

Version-Release number of selected component (if applicable): 3.4.0.16 and etcd 3.0.12-3


How reproducible: always on first restart after etcd data migration to V3


Steps to Reproduce:
1.  Install an HA cluster (3 masters, 3 etcd) with OCP 3.4.0.16 + etcd 2.3.7
2.  Create projects with running deployments
3.  Shutdown masters and etcd.   Leave OpenShift nodes running.
4.  On each etcd:  yum swap etcd3 etcd to install etcd3 3.0.12-3. 
5.  On each etcd:  etcdctl migrate --data-dir /var/lib/etcd
6.  Start etcd on each 
7.  Start OpenShift masters

Actual results:

OpenShift master logs have many messages (see above) for object conversion and conversion failures.

Expected results:

Unknown - possibly unexpected that conversions are taking place when the objects are unchanged in terms of version or schema.   Looking for confirmation on correct behavior.

Comment 3 Timothy St. Clair 2016-10-31 20:11:34 UTC
Q1: Pre 'etcdctl migrate' were you seeing large # of conversion traces? 

C1: the "failed to handle multiple devices for container. Skipping Filesystem stats" is orthogonal to the api-conversion data.  

C2: There are Storage errors on leases which was expected, because we used the default migrator.

Comment 4 Timothy St. Clair 2016-10-31 21:08:00 UTC
So in chatting with clayton, there is indeed a 'v2mode:json -> v3mode:protobuf' conversion that is implicit.

We will simply need to document that the cluster should be brought back online slowly to allow for the data conversion.

We will also need to be certain that TTL keys are wiped on the migration. xref (https://github.com/coreos/etcd/issues/6767)