Description of problem: Now that https://bugzilla.redhat.com/show_bug.cgi?id=1389770 is fixed, we can look at the actual results of a migration. Two major issues after migrating etcd2 data to etcd3 and configuring the apiServer to use etcd3 storage mode. 1. The nodes will not register with the master without a restart. They just periodically put out the following message. Nov 10 03:37:08 ip-172-31-2-242 atomic-openshift-node: E1110 03:37:08.228713 12042 kubelet_node_status.go:316] Unable to update node status: update node status exceeds retry count Nov 10 03:37:10 ip-172-31-2-242 atomic-openshift-node: E1110 03:37:10.354985 12042 eviction_manager.go:162] eviction manager: unexpected err: failed GetNode: node 'ip-172-31-2-242.us-west-2.compute.internal' not found Nov 10 03:37:18 ip-172-31-2-242 atomic-openshift-node: E1110 03:37:18.231659 12042 kubelet_node_status.go:324] Error updating node status, will retry: error getting node "ip-172-31-2-242.us-west-2.compute.internal": nodes "ip-172-31-2-242.us-west-2.compute.internal" not found 2. Closely related (I think). Namespaces, deployments, builds, etc defined before the migration are gone afterwards. Only the default namespaces are there, but they are empty as well - no deployments, secrets, imagestreams, etc. Version-Release number of selected component (if applicable): 3.4.0.24 How reproducible: Always Steps to Reproduce: 1. Install a 3.4 cluster: 1 etcd (3.0.14-1), 1 master, 1 infra node, 2 app nodes 2. Create some new namespaces, create some quick start applications, verify pods are running 3. Shutdown the master. Shutdown etcd 4. Back up /var/lib/etcd on the etcd server 5. On etcd: ETCDCTL_API=3 ./etcdctl migrate --data-dir=/var/lib/etcd --no-ttl 6. On the master, edit master-config.yaml and add the following to apiServerArguments: apiServerArguments: storage-backend: - "etcd3" 7. Restart etcd, restart the master Actual results: 1. No nodes will register with the master until restarted. The messages above appears in the syslog on the nodes 2. oc get projects, oc get pods --all-namespaces, etc will return no resources. Expected results: Cluster has the same content as pre-migration. Additional info: Messages from etcd migration: root@ip-172-31-53-209: /var/lib # ETCDCTL_API=3 etcdctl migrate --data-dir=/var/lib/etcd --no-ttl using default transformer 2016-11-10 03:31:44.712825 I | api: enabled capabilities for version 3.0 2016-11-10 03:31:44.712859 I | membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster 0 from store 2016-11-10 03:31:44.712867 I | membership: set the cluster version to 3.0 from store finished transforming keys
Dumped keys after migration - content seems to be there. Lots of list failures in the master logs: Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.662287 74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664074 74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.LimitRange: User "system:openshift-master" cannot list all limitranges in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664192 74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.Namespace: User "system:openshift-master" cannot list all namespaces in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664393 74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.SecurityContextConstraints: User "system:openshift-master" cannot list all securitycontextconstraints in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.664406 74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.ImageStream: User "system:openshift-master" cannot list all imagestreams in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.762346 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:103: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.763258 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/resourcequota/resource_access.go:83: Failed to list *api.ResourceQuota: User "system:openshift-master" cannot list all resourcequotas in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.778946 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: Failed to list *api.Secret: User "system:openshift-master" cannot list all secrets in the cluster Feb 14 15:32:51 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:51.779168 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/storageclass/default/admission.go:75: Failed to list *storage.StorageClass: User "system:openshift-master" cannot list all storage.k8s.io.storageclasses in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.020282 74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:101: Failed to list *api.ClusterResourceQuota: User "system:openshift-master" cannot list all clusterresourcequotas in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.020383 74760 reflector.go:188] github.com/openshift/origin/pkg/project/cache/cache.go:107: Failed to list *api.Namespace: User "system:openshift-master" cannot list all namespaces in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668403 74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.SecurityContextConstraints: User "system:openshift-master" cannot list all securitycontextconstraints in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668493 74760 reflector.go:199] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:89: Failed to list *api.ImageStream: User "system:openshift-master" cannot list all imagestreams in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668574 74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668635 74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.LimitRange: User "system:openshift-master" cannot list all limitranges in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.668691 74760 reflector.go:199] pkg/controller/informers/factory.go:89: Failed to list *api.Namespace: User "system:openshift-master" cannot list all namespaces in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.763775 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:103: Failed to list *api.ServiceAccount: User "system:openshift-master" cannot list all serviceaccounts in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.765069 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/resourcequota/resource_access.go:83: Failed to list *api.ResourceQuota: User "system:openshift-master" cannot list all resourcequotas in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.802041 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/storageclass/default/admission.go:75: Failed to list *storage.StorageClass: User "system:openshift-master" cannot list all storage.k8s.io.storageclasses in the cluster Feb 14 15:32:52 ip-172-31-2-44 atomic-openshift-master: E0214 15:32:52.802262 74760 reflector.go:199] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: Failed to list *api.Secret: User "system:openshift-master" cannot list all secrets in the cluster
Created attachment 1250335 [details] Before dump (keys only)
Created attachment 1250336 [details] After dump
It's blocked on the admission controller post conversion.
Some number of denials are normal at server start as the authz cache fills. Do the list errors continue indefinitely in the log?
Will re-test today and give it more time. Will keep the env around as well if anyone wants to take a look at it.
OCP 3.5.0.20 and etcd 3.1.0 - created 2 projects with deployments, builds, secrets, routes, services etc. Everything working. - shutdown etcd, shutdown master (single master) - ETCDCTL_API=3 etcdctl migrate --data-dir=/var/lib/etcd --no-ttl - Updated master-config.yaml to use storage-backend "etcd3" - started etcd, started master, restarted all nodes There were some initial list failures in the master log, but as indicated in comment 11 they eventually went away. They did not continue indefinitely. The logs were quiet after the nodes re-registered. However oc get on projects, pods, dc, services, builds, etc did not return any of the resources created pre-migration. No errors in the log (attached). The default projects are there but no imagestreams, templates, etc that are part of the install. New resources can be created and displayed, but pre-migration items are gone even though they seem to exist in etcdctl get "" --from-key
Created attachment 1250628 [details] Log for initial master startup after etcd data migration
whatever the migration is doing, it is making etcd appear completely empty to the master at startup: ensure.go:222] No cluster policy found. Creating bootstrap policy based on: /etc/origin/master/policy.json
can you dump the contents of etcd after the master has started up after migration? want to find out where the new content is getting stored
Created attachment 1250636 [details] Dump after master restart and creating 1 project (keys only)
Created attachment 1250637 [details] Dump after master restart and creating 1 project (keys/values)
Projects mff0 and mff1 created before migration. Project mff (sorry for similarity) created after migration and master restart.
There are two sets of keys, one with leading slashes and one without. Looks like the etcd3 client does not include leading '/' when accessing the data, which makes all existing data seem to disappear. New data is created without leading slashes
Part of the fix is https://github.com/kubernetes/kubernetes/pull/42506 I will open another PR to handle some decoder issues.
Upon further research the only change we need is https://github.com/kubernetes/kubernetes/pull/42506 (it may be a bit before we have this change in origin). Decoder issues only occur in unsupported configurations (going from etcdv2+protobuf to etcdv3+protobuf instead of the supported etcdv2+json to etcdv3+protobuf).
Origin PR open at https://github.com/openshift/origin/pull/13298
The fix is merged into master and the 1.5 release branch. No pick is required for OSE. https://github.com/openshift/origin/pull/13298 https://github.com/openshift/origin/pull/13299
https://github.com/openshift/origin/pull/13299
This has been merged into ocp and is in OCP v3.5.0.52 or newer.
Verified on 3.5.0.52. 1. Run cluster-loader to create 10 projects with builds, bcs, svc, routes, dcs, rcs, pods and secrets. 2. shutdown master and etcd 3. ETCDCTL_API=3 ./etcdctl migrate --data-dir=${data_dir} --no-ttl 4. restart etcd 5. configure master-api for etcd3 storage 6. restart master Verify all expected resource exist, all pods are running, users can login, etc. Create new resources and verify they work as expected.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884