1393744 – [etcd3]All defined namespaces and contents disappear after migrating from etcd2 to etcd3 storage

Bug 1393744 - [etcd3]All defined namespaces and contents disappear after migrating from etcd2 to etcd3 storage

Summary: [etcd3]All defined namespaces and contents disappear after migrating from etc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Mo
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-10 09:03 UTC by Mike Fiedler
Modified:	2017-07-24 14:11 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The code used to build the root etcd prefix was not the same between etcdv2 and etcdv3. Consequence: After migrating from etcdv2 to etcdv3, the cluster was unable to find any data if a root etcd prefix was used that did not start with a "/" (which is the default case for OpenShift). Fix: Use the same code to build the root etcd prefix for both etcdv2 and etcdv3. Result: After a migration, the cluster is able to find migrated data as expected.
Clone Of:
Environment:
Last Closed:	2017-04-12 19:16:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Before dump (keys only) (53.33 KB, text/plain) 2017-02-14 20:41 UTC, Mike Fiedler	no flags	Details
After dump (3.26 MB, text/plain) 2017-02-14 20:42 UTC, Mike Fiedler	no flags	Details
Log for initial master startup after etcd data migration (516.35 KB, text/plain) 2017-02-15 15:09 UTC, Mike Fiedler	no flags	Details
Dump after master restart and creating 1 project (keys only) (52.74 KB, text/plain) 2017-02-15 15:43 UTC, Mike Fiedler	no flags	Details
Dump after master restart and creating 1 project (keys/values) (3.74 MB, text/plain) 2017-02-15 15:44 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 13298	None	closed	UPSTREAM: 42622: Preserve custom etcd prefix compatibility for etcd3	2020-08-26 17:53:26 UTC
Github	openshift origin pull 13299	None	closed	UPSTREAM: 42622: Preserve custom etcd prefix compatibility for etcd3	2020-08-26 17:53:26 UTC
Red Hat Product Errata	RHBA-2017:0884	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Mike Fiedler 2016-11-10 09:03:02 UTC

Description of problem:

Now that https://bugzilla.redhat.com/show_bug.cgi?id=1389770 is fixed, we can look at the actual results of a migration. Two major issues after migrating etcd2 data to etcd3 and configuring the apiServer to use etcd3 storage mode.

1. The nodes will not register with the master without a restart. They just periodically put out the following message.

Nov 10 03:37:08 ip-172-31-2-242 atomic-openshift-node: E1110 03:37:08.228713 12042 kubelet_node_status.go:316] Unable to update node status: update node status exceeds retry count
Nov 10 03:37:10 ip-172-31-2-242 atomic-openshift-node: E1110 03:37:10.354985 12042 eviction_manager.go:162] eviction manager: unexpected err: failed GetNode: node 'ip-172-31-2-242.us-west-2.compute.internal' not found
Nov 10 03:37:18 ip-172-31-2-242 atomic-openshift-node: E1110 03:37:18.231659 12042 kubelet_node_status.go:324] Error updating node status, will retry: error getting node "ip-172-31-2-242.us-west-2.compute.internal": nodes "ip-172-31-2-242.us-west-2.compute.internal" not found

2. Closely related (I think). Namespaces, deployments, builds, etc defined before the migration are gone afterwards. Only the default namespaces are there, but they are empty as well - no deployments, secrets, imagestreams, etc.

Version-Release number of selected component (if applicable): 3.4.0.24

How reproducible: Always

Steps to Reproduce:
1. Install a 3.4 cluster: 1 etcd (3.0.14-1), 1 master, 1 infra node, 2 app nodes
2. Create some new namespaces, create some quick start applications, verify pods are running
3. Shutdown the master. Shutdown etcd
4. Back up /var/lib/etcd on the etcd server
5. On etcd: ETCDCTL_API=3 ./etcdctl migrate --data-dir=/var/lib/etcd --no-ttl
6. On the master, edit master-config.yaml and add the following to apiServerArguments:

apiServerArguments:
storage-backend:
- "etcd3"

7. Restart etcd, restart the master

Actual results:

1. No nodes will register with the master until restarted. The messages above appears in the syslog on the nodes
2. oc get projects, oc get pods --all-namespaces, etc will return no resources.

Expected results:

Cluster has the same content as pre-migration.

Additional info:

Messages from etcd migration:

root@ip-172-31-53-209: /var/lib # ETCDCTL_API=3 etcdctl migrate --data-dir=/var/lib/etcd --no-ttl
using default transformer
2016-11-10 03:31:44.712825 I | api: enabled capabilities for version 3.0
2016-11-10 03:31:44.712859 I | membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster 0 from store
2016-11-10 03:31:44.712867 I | membership: set the cluster version to 3.0 from store
finished transforming keys

Comment 7 Mike Fiedler 2017-02-14 20:41:11 UTC

Dumped keys after migration - content seems to be there.    Lots of list failures in the master logs:

Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.662287   74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664074   74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.LimitRange: User "system:openshift-master" cannot list all limitranges in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664192   74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.Namespace: User "system:openshift-master" cannot list all namespaces in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664393   74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.SecurityContextConstraints: User "system:openshift-master" cannot list all securitycontextconstraints in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664406   74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.ImageStream: User "system:openshift-master" cannot list all imagestreams in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.762346   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:103: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.763258   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/resourcequota/resource_access.go:83: Failed to list *api.ResourceQuota: User "system:openshift-master" cannot list all resourcequotas in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.778946   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: Failed to list *api.Secret: User "system:openshift-master" cannot list all secrets in the cluster
Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.779168   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/storageclass/default/admission.go:75: Failed to list *storage.StorageClass: User "system:openshift-master" cannot list all storage.k8s.io.storageclasses in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.020282   74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:101: Failed to list *api.ClusterResourceQuota: User "system:openshift-master" cannot list all clusterresourcequotas in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.020383   74760 reflector.go:188] github.com/openshift/origin/pkg/project/cache/cache.go:107: Failed to list *api.Namespace: User "system:openshift-master" cannot list all namespaces in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668403   74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.SecurityContextConstraints: User "system:openshift-master" cannot list all securitycontextconstraints in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668493   74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.ImageStream: User "system:openshift-master" cannot list all imagestreams in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668574   74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668635   74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.LimitRange: User "system:openshift-master" cannot list all limitranges in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668691   74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.Namespace: User "system:openshift-master" cannot list all namespaces in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.763775   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:103: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.765069   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/resourcequota/resource_access.go:83: Failed to list *api.ResourceQuota: User "system:openshift-master" cannot list all resourcequotas in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.802041   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/storageclass/default/admission.go:75: Failed to list *storage.StorageClass: User "system:openshift-master" cannot list all storage.k8s.io.storageclasses in the cluster
Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.802262   74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: Failed to list *api.Secret: User "system:openshift-master" cannot list all secrets in the cluster

Comment 8 Mike Fiedler 2017-02-14 20:41:44 UTC

Created attachment 1250335 [details]
Before dump (keys only)

Comment 9 Mike Fiedler 2017-02-14 20:42:11 UTC

Created attachment 1250336 [details]
After dump

Comment 10 Timothy St. Clair 2017-02-15 01:10:19 UTC

It's blocked on the admission controller post conversion.

Comment 11 Jordan Liggitt 2017-02-15 02:02:16 UTC

Some number of denials are normal at server start as the authz cache fills. Do the list errors continue indefinitely in the log?

Comment 12 Mike Fiedler 2017-02-15 13:08:58 UTC

Will re-test today and give it more time.   Will keep the env around as well if anyone wants to take a look at it.

Comment 13 Mike Fiedler 2017-02-15 15:08:53 UTC

OCP 3.5.0.20 and etcd 3.1.0

- created 2 projects with deployments, builds, secrets, routes, services etc.   Everything working.
- shutdown etcd, shutdown master (single master)
- ETCDCTL_API=3 etcdctl migrate  --data-dir=/var/lib/etcd --no-ttl
- Updated master-config.yaml to use storage-backend "etcd3"
- started etcd, started master, restarted all nodes

There were some initial list failures in the master log, but as indicated in comment 11 they eventually went away.   They did not continue indefinitely.   The logs were quiet after the nodes re-registered.

However oc get on projects, pods, dc, services, builds, etc did not return any of the resources created pre-migration.  No errors in the log (attached).

The default projects are there but no imagestreams, templates, etc that are part of the install.   

New resources can be created and displayed, but pre-migration items are gone even though they seem to exist in etcdctl get "" --from-key

Comment 14 Mike Fiedler 2017-02-15 15:09:42 UTC

Created attachment 1250628 [details]
Log for initial master startup after etcd data migration

Comment 15 Jordan Liggitt 2017-02-15 15:12:58 UTC

whatever the migration is doing, it is making etcd appear completely empty to the master at startup:

ensure.go:222] No cluster policy found.  Creating bootstrap policy based on: /etc/origin/master/policy.json

Comment 16 Jordan Liggitt 2017-02-15 15:19:42 UTC

can you dump the contents of etcd after the master has started up after migration? want to find out where the new content is getting stored

Comment 17 Mike Fiedler 2017-02-15 15:43:30 UTC

Created attachment 1250636 [details]
Dump after master restart and creating 1 project (keys only)

Comment 18 Mike Fiedler 2017-02-15 15:44:03 UTC

Created attachment 1250637 [details]
Dump after master restart and creating 1 project (keys/values)

Comment 19 Mike Fiedler 2017-02-15 15:44:53 UTC

Projects mff0 and mff1 created before migration.

Project mff (sorry for similarity) created after migration and master restart.

Comment 20 Jordan Liggitt 2017-02-15 15:47:59 UTC

There are two sets of keys, one with leading slashes and one without.

Looks like the etcd3 client does not include leading '/' when accessing the data, which makes all existing data seem to disappear.

New data is created without leading slashes

Comment 21 Mo 2017-03-04 00:08:25 UTC

Part of the fix is https://github.com/kubernetes/kubernetes/pull/42506

I will open another PR to handle some decoder issues.

Comment 22 Mo 2017-03-06 15:20:22 UTC

Upon further research the only change we need is https://github.com/kubernetes/kubernetes/pull/42506 (it may be a bit before we have this change in origin).

Decoder issues only occur in unsupported configurations (going from etcdv2+protobuf to etcdv3+protobuf instead of the supported etcdv2+json to etcdv3+protobuf).

Comment 23 Mo 2017-03-08 05:15:38 UTC

Origin PR open at https://github.com/openshift/origin/pull/13298

Comment 24 Mo 2017-03-11 20:33:50 UTC

The fix is merged into master and the 1.5 release branch.  No pick is required for OSE.

https://github.com/openshift/origin/pull/13298

https://github.com/openshift/origin/pull/13299

Comment 28 Mo 2017-03-14 10:54:24 UTC

https://github.com/openshift/origin/pull/13299

Comment 29 Troy Dawson 2017-03-14 14:23:41 UTC

This has been merged into ocp and is in OCP v3.5.0.52 or newer.

Comment 31 Mike Fiedler 2017-03-14 17:04:13 UTC

Verified on 3.5.0.52.   

1. Run cluster-loader to create 10 projects with builds, bcs, svc, routes, dcs, rcs, pods and secrets.
2. shutdown master and etcd
3. ETCDCTL_API=3 ./etcdctl migrate  --data-dir=${data_dir} --no-ttl
4. restart etcd
5. configure master-api for etcd3 storage
6. restart master

Verify all expected resource exist, all pods are running, users can login, etc.

Create new resources and verify they work as expected.

Comment 33 errata-xmlrpc 2017-04-12 19:16:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.