1501752 – OCP cluster does not work after migrate from etcd2 to etcd3 if no .snap file is created before migrate

Bug 1501752 - OCP cluster does not work after migrate from etcd2 to etcd3 if no .snap file is created before migrate

Summary: OCP cluster does not work after migrate from etcd2 to etcd3 if no .snap file ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Jan Chaloupka
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1552252 (view as bug list)
Depends On:
Blocks:	1724792
TreeView+	depends on / blocked

Reported:	2017-10-13 06:42 UTC by liujia
Modified:	2019-06-28 16:04 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The etcd v3 data got migrated before the first snapshot of v2 data got written Consequence: Without a v2 snapshot the v3 data did not got propagated properly to remaining etcd members resulting in a loss of some of v3 data Fix: Check if there is at least one v2 snapshot before the etcd data migration proceeds Result: The etcd v3 data are properly distributed among all members
Clone Of:
Environment:
Last Closed:	2017-11-28 22:16:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description liujia 2017-10-13 06:42:13 UTC

Description of problem:
Upgrade ocp v3.5 to v3.6(ha), ensure openshift works well and then run migrate etcd v2 to etcd v3. After migrate finished, openshift does not work, for example, some original data can not get by "oc get" command, and original app and project missed.

#cat master-config.yml
storage-backend:
    - etcd3
    storage-media-type:
    - application/vnd.kubernetes.protobuf

# oc get pod
No resources found.
# oc get project  //There should be a project named install-test.
NAME              DISPLAY NAME   STATUS
default                          Active
kube-public                      Active
kube-system                      Active
openshift                        Active
openshift-infra                  Active

==========================
some debug info
==========================
1) Restart all master api and controllers services
Still works abnormal, but "oc get pod/oc get project" can work now and "oc get node" can not work.
# oc get project
NAME               DISPLAY NAME   STATUS
default                           Active
install-test                      Active
kube-public                       Active
kube-system                       Active
logging                           Active
management-infra                  Active
openshift                         Active
openshift-infra                   Active

# oc get node
No resources found.

2) Restart etcd service
# oc get node
No resources found.

3) Stop all master and etcd service, and add "ETCD_FORCE_NEW_CLUSTER=true" in etcd.conf on one etcd1 host, and start master and etcd service on etcd1 host.

works well

4) Stop all master and etcd service, and add "ETCD_FORCE_NEW_CLUSTER=true" in etcd.conf on one etcd2 host, and start master and etcd service on etcd2 host.
# oc get node
Error from server (Forbidden): User "system:admin" cannot list all nodes in the cluster
# oc get pod
Error from server (Forbidden): User "system:admin" cannot list pods in project "default"


Version-Release number of the following components:
atomic-openshift-utils-3.6.173.0.21-2.git.0.44a4038.el7.noarch
openshift-ansible-roles-3.6.173.0.21-2.git.0.44a4038.el7.noarch
openshift-ansible-3.6.173.0.21-2.git.0.44a4038.el7.noarch
openshift-ansible-playbooks-3.6.173.0.21-2.git.0.44a4038.el7.noarch
ansible-2.4.0.0-5.el7.noarch
etcd-3.2.5-1.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1. HA install ocp v3.5
2. Upgrade v3.5 to v3.6
3. Do etcd migrade
# ansible-playbook -i hosts  /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml 

Actual results:
After migrate, cluster does not work.

Expected results:
Cluster should works well after etcd migrade.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Scott Dodson 2017-10-13 13:49:37 UTC

What version of etcd is installed on the etcd hosts? This may be 

https://bugzilla.redhat.com/show_bug.cgi?id=1489168

which is fixed by upgrading to etcd 3.2.7 which should ship soon, you can get a brew build here 

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=597908

Can you try that?

Comment 4 Scott Dodson 2017-10-13 13:49:59 UTC

Also can you please grab master logs from all masters?

Comment 6 Jordan Liggitt 2017-10-13 15:25:45 UTC

no, `oc get` does not work from the watch cache

Comment 7 Scott Dodson 2017-10-13 19:22:48 UTC

(In reply to liujia from comment #0)

> 3) Stop all master and etcd service, and add "ETCD_FORCE_NEW_CLUSTER=true"
> in etcd.conf on one etcd1 host, and start master and etcd service on etcd1
> host.
> 
> works well
> 
> 4) Stop all master and etcd service, and add "ETCD_FORCE_NEW_CLUSTER=true"
> in etcd.conf on one etcd2 host, and start master and etcd service on etcd2
> host.
> # oc get node
> Error from server (Forbidden): User "system:admin" cannot list all nodes in
> the cluster
> # oc get pod
> Error from server (Forbidden): User "system:admin" cannot list pods in
> project "default"

I think you just did this to debug, but you should never do #3 or #4 after having run the migration. You should only ever force a new cluster on one host and then after doing that you need to add the other members one by one and this is what the migration playbooks do for you.

Reading through the migration log I can't find anything out of the ordinary. We'll definitely need master logs and etcd host logs. Ideally full journals from all of them.

Is there any chance that the load balancer in front of the api server is simply misconfigured? Can you try overriding the server endpoint?

`oc --server https://qe-jliu-ha-master-etcd-1.1013-16d.qe.rhcloud.com:8443 get nodes`

etc to target each api server isolating which one is returning invalid results.

Comment 8 liujia 2017-10-16 02:44:34 UTC

(In reply to Scott Dodson from comment #7)

> I think you just did this to debug, but you should never do #3 or #4 after
> having run the migration. You should only ever force a new cluster on one
> host and then after doing that you need to add the other members one by one
> and this is what the migration playbooks do for you.
> 
Yes, doing #3 and #4 is just for debug, because I suspected that the etcd data seemed inconsistent after migration. 

> Reading through the migration log I can't find anything out of the ordinary.
> We'll definitely need master logs and etcd host logs. Ideally full journals
> from all of them.
> 
> Is there any chance that the load balancer in front of the api server is
> simply misconfigured? Can you try overriding the server endpoint?
> 
> `oc --server https://qe-jliu-ha-master-etcd-1.1013-16d.qe.rhcloud.com:8443
> get nodes`
> 
> etc to target each api server isolating which one is returning invalid
> results.

I will do that and try to give more logs.

Comment 9 liujia 2017-10-16 02:45:48 UTC

(In reply to Scott Dodson from comment #3)
> What version of etcd is installed on the etcd hosts? This may be 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1489168
> 
> which is fixed by upgrading to etcd 3.2.7 which should ship soon, you can
> get a brew build here 
> 
> https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=597908
> 
> Can you try that?

Etcd version is etcd-3.2.5-1.el7.x86_64, I will try etcd 3.2.7 again.

Comment 17 Jan Chaloupka 2017-10-16 19:44:58 UTC

Liujia, can you be more specific on the exacts steps you took to deploy the cluster, upgrade it and migrate it? E.g. how did you create the install-test project? Where is the template/command I can use to create it? What rpm version of openshift-ansible did you use to deploy the 3.5 cluster? Did you use the attached inventory for deployment, upgrade and etcd data migration? Or was there any modification between the steps?

At the moment I can not tell if the migration failed or if the etcd v3 data got corrupted after the migration. If the migrate.yml playbook run successfully, something had to delete the nodes.

Are you able to reproduce it again? If so, can you do the following?

1) deploy a new 3.5 cluster and backup the etcd after
2) upgrade 3.5 cluster to 3.6 and backup etcd after
3) migrate v2 etcd data to v3 in 3.6 cluster and backup etcd after

The upgrade itself creates the backup before and after the control plane is upgraded. However, I would like to have a backup right after the entire cluster is upgraded so I know the state of the etcd data right after the upgrade is done. So at the end I would like to ask you for:

1) backup right after the 3.5 cluster
2) backup right after the 3.5->3.6 cluster upgrade
3) backup right after the v2->v3 migration.

It will help me to compare the list of nodes right after deployment, right after the upgrade and right after the migration. Thank you.

Comment 18 liujia 2017-10-17 06:04:39 UTC

Jan,

The issue can be reproduced both for openshift-ansible-3.6.173.0.21-2(This is the the latest available release version when hit bug 1500631 and QE backed to a released version) and atomic-openshift-utils-3.6.173.0.53-1(This is the latest build I just tried which has fixed bug 1500631).  

The exacts steps:
1. HA install ocp v3.5.5.31.36 with openshift-ansible-3.5.134-1.git.0.e5f4029.el7.noarch.rpm.

2. Create new project and new-app
#oc new-project install-test
#oc new-app nodejs-mongodb-example

3. Upgrade v3.5 to v3.6.
After upgrade, check "oc get" works well and original app works well.

4. Do etcd migrade with openshift-ansible-3.6.173.0.21-2.git.0.44a4038.el7.noarch
After migrate, do the same check as step3, then hit the issue.

All above operations were based the same inventory file except that the only difference was that "openshift_disable_check=*" added in upgrade/migrate.

I will leave the new env for you, the old one has been clean up.

Comment 27 Scott Dodson 2017-10-18 14:12:58 UTC

We cannot reproduce this with the latest version of the 3.6 playbooks. Can you please re-test with those.

openshift-ansible-3.6.173.0.48-1.git.0.1609d30.el7 is the latest available to customers.

Comment 28 Scott Dodson 2017-10-18 14:14:54 UTC

When attmepting to reproduce please ensure that no additional steps are taken, other than investigating the status of objects.

ie: 3.5 install, 3.6 upgrade, playbooks/byo/openshift-etcd/migrate.yml, `oc get nodes`

If the problem is reproduced lets gather the complete journal from all hosts in the cluster with `journalctl --no-pager`

Comment 35 Scott Dodson 2017-10-25 20:05:05 UTC

Inspecting the controller logs from comment 11 shows that after the migration is completed the controllers can no longer look up the instances. Moving this over to master team to debug but without API server, controller, and etcd logs from each master this will be hard to pin down and unless I'm missing it I don't see any attachments thus far which includes the complete set of logs from all masters.

Comment 36 Scott Dodson 2017-10-25 20:24:49 UTC

A bit of log analysis, there's more in the attachment in comment 11 than what i've posted here.

Oct 16 02:29:35 qe-jliu-35ha-master-etcd-1 systemd[1]: Stopping Atomic OpenShift Master Controllers...
Oct 16 02:29:35 qe-jliu-35ha-master-etcd-1 systemd[1]: Stopped Atomic OpenShift Master Controllers.
Oct 16 02:29:37 qe-jliu-35ha-master-etcd-1 systemd[1]: Stopping Atomic OpenShift Master API...
Oct 16 02:29:37 qe-jliu-35ha-master-etcd-1 systemd[1]: Stopped Atomic OpenShift Master API.
Oct 16 02:30:01 qe-jliu-35ha-master-etcd-1 systemd[1]: Stopping Etcd Server...

master and etcd are stopped, migration happens here

Oct 16 02:30:20 qe-jliu-35ha-master-etcd-1 systemd[1]: Starting Etcd Server...
Oct 16 02:30:20 qe-jliu-35ha-master-etcd-1 etcd[54572]: recognized and used environment variable ETCD_FORCE_NEW_CLUSTER=true
Oct 16 02:30:24 qe-jliu-35ha-master-etcd-1 systemd[1]: Started Etcd Server.
Oct 16 02:30:27 qe-jliu-35ha-master-etcd-1 systemd[1]: Stopping Etcd Server...
Oct 16 02:30:27 qe-jliu-35ha-master-etcd-1 systemd[1]: Starting Etcd Server...

A new cluster is forced, then restarted.

Oct 16 02:30:52 qe-jliu-35ha-master-etcd-1 etcd[54727]: added member 285e252ad759c81f [https://10.240.0.18:2380] to cluster b1307064eb59fc6e
Oct 16 02:30:52 qe-jliu-35ha-master-etcd-1 etcd[54727]: starting peer 285e252ad759c81f...
Oct 16 02:32:18 qe-jliu-35ha-master-etcd-1 etcd[54727]: peer 285e252ad759c81f became active

2nd etcd host is added and becomes active after copying over a snapshot

Oct 16 02:32:19 qe-jliu-35ha-master-etcd-1 etcd[54727]: 6fc5620f15e1a419 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections
Oct 16 02:32:53 qe-jliu-35ha-master-etcd-1 etcd[54727]: added member 5da5bafb874ce969 [https://10.240.0.19:2380] to cluster b1307064eb59fc6e
Oct 16 02:34:31 qe-jliu-35ha-master-etcd-1 etcd[54727]: peer 5da5bafb874ce969 became active

3rd etcd host is added and becomes active after copying over a snapshot


Oct 16 02:34:33 qe-jliu-35ha-master-etcd-1 etcd[54727]: health check for peer 5da5bafb874ce969 could not connect: dial tcp 10.240.0.19:2380: getsockopt: connection refused

Why'd that happen?


Oct 16 02:35:55 qe-jliu-35ha-master-etcd-1 systemd[1]: Starting Atomic OpenShift Master API...
Oct 16 02:35:57 qe-jliu-35ha-master-etcd-1 systemd[1]: Started Atomic OpenShift Master API.
Oct 16 02:35:55 qe-jliu-35ha-master-etcd-1 openshift[57858]: Failed to dial qe-jliu-35ha-master-etcd-2:2379: connection error: desc = "transport: context canceled"; please retry.
Oct 16 02:35:55 qe-jliu-35ha-master-etcd-1 openshift[57858]: Failed to dial qe-jliu-35ha-master-etcd-1:2379: connection error: desc = "transport: context canceled"; please retry.
Oct 16 02:35:55 qe-jliu-35ha-master-etcd-1 openshift[57858]: Failed to dial qe-jliu-35ha-master-etcd-2:2379: connection error: desc = "transport: context canceled"; please retry.
Oct 16 02:35:57 qe-jliu-35ha-master-etcd-1 systemd[1]: Started Atomic OpenShift Master API.
Oct 16 02:35:57 qe-jliu-35ha-master-etcd-1 openshift[57858]: Failed to dial qe-jliu-35ha-master-etcd-1:2379: connection error: desc = "transport: context canceled"; please retry.
Oct 16 02:35:57 qe-jliu-35ha-master-etcd-1 openshift[57858]: Failed to dial qe-jliu-35ha-master-etcd-2:2379: grpc: the connection is closing; please retry.    
Oct 16 02:38:11 qe-jliu-35ha-master-etcd-1 openshift[58027]: Failed to dial qe-jliu-35ha-master-etcd-2:2379: connection error: desc = "transport: context canceled"; please retry.
Oct 16 02:38:11 qe-jliu-35ha-master-etcd-1 openshift[58027]: Failed to dial qe-jliu-35ha-master-etcd-3:2379: connection error: desc = "transport: context canceled"; please retry.

API service starts and for 3 minutes there's etcd connection problems?

Ok, now lets focus on what happens with node "qe-jliu-35ha-node-primary-1"

Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:12.029977   58027 subnets.go:35] Found existing HostSubnet qe-jliu-35ha-node-primary-1 (host: "qe-jliu-35ha-node-primary-1", ip: "10.240.0.22", subnet: "10.2.10.0/23")
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:38:12.779581   58027 actual_state_of_world.go:468] Failed to set statusUpdateNeeded to needed true because nodeName="qe-jliu-35ha-node-primary-1"  does not exist
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:38:12.779593   58027 actual_state_of_world.go:482] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true because nodeName="qe-jliu-35ha-node-primary-1"  does not exist
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:12.810838   58027 nodecontroller.go:616] NodeController observed a new Node: "qe-jliu-35ha-node-primary-1"
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:12.810843   58027 controller_utils.go:273] Recording Registered Node qe-jliu-35ha-node-primary-1 in NodeController event message for node qe-jliu-35ha-node-primary-1
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: W1016 02:38:12.811002   58027 nodecontroller.go:956] Missing timestamp for Node qe-jliu-35ha-node-primary-1. Assuming now as a timestamp.
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:12.811125   58027 event.go:217] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-35ha-node-primary-1", UID:"35bc7512-b21e-11e7-a3a9-42010af00013", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node qe-jliu-35ha-node-primary-1 event: Registered Node qe-jliu-35ha-node-primary-1 in NodeController
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:12.811125   58027 event.go:217] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-35ha-node-primary-1", UID:"35bc7512-b21e-11e7-a3a9-42010af00013", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node qe-jliu-35ha-node-primary-1 event: Registered Node qe-jliu-35ha-node-primary-1 in NodeController
Oct 16 02:38:12 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:12.855919   58027 servicecontroller.go:656] Detected change in list of current cluster nodes. New node set: [qe-jliu-35ha-node-primary-1 qe-jliu-35ha-node-primary-2 qe-jliu-35ha-node-registry-router-1 qe-jliu-35ha-node-registry-router-2]

Oct 16 02:38:52 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:52.828968   58027 controller_utils.go:284] Recording status change NodeNotReady event message for node qe-jliu-35ha-node-primary-1
Oct 16 02:38:52 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:52.829018   58027 controller_utils.go:202] Update ready status of pods on node [qe-jliu-35ha-node-primary-1]
Oct 16 02:38:52 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:52.829187   58027 event.go:217] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"qe-jliu-35ha-node-primary-1", UID:"35bc7512-b21e-11e7-a3a9-42010af00013", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node qe-jliu-35ha-node-primary-1 status is now: NodeNotReady
Oct 16 02:38:52 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:38:52.832232   58027 controller_utils.go:219] Updating ready status of pod pltest-1-mjh11 to false
Oct 16 02:38:52 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: W1016 02:38:52.839449   58027 controller_utils.go:222] Failed to update status for pod "pltest-1-mjh11_prozyp(03aa3f98-b238-11e7-bd29-42010af00012)": namespaces "prozyp" not found
Oct 16 02:38:52 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:38:52.839494   58027 nodecontroller.go:748] Unable to mark all pods NotReady on node qe-jliu-35ha-node-primary-1: namespaces "prozyp" not foundOct 16 02:38:52 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:38:52.943730   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:00 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:00.464889   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:06 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:06.846460   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:13 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:13.070572   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:18 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:18.772693   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:25 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:25.724411   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:31 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:31.766773   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:38 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:38.307731   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound
Oct 16 02:39:45 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: E1016 02:39:45.078817   58027 gce.go:2910] getInstanceByName: failed to get instance qe-jliu-35ha-node-primary-1; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b/instances/qe-jliu-35ha-node-primary-1' was not found, notFound

Oct 16 02:43:13 qe-jliu-35ha-master-etcd-1 atomic-openshift-master-controllers[58027]: I1016 02:43:13.989593   58027 nodecontroller.go:723] Node is unresponsive. Adding Pods on Node qe-jliu-35ha-node-primary-1 to eviction queues: 2017-10-1

The node is finally removed here?

Comment 45 liujia 2017-10-30 08:20:15 UTC

This bug blocked some user story related cases test, QE need complete the user story test before code freeze. So add testblocker tag.

Comment 49 Michal Fojtik 2017-10-31 12:19:54 UTC

If i'm reading this properly, this is not a bug and the data will be eventually propagated to the second and third member. I'm moving this off the 3.7 blocker list, maybe we should improve our documentation about this process?

Comment 50 Jan Chaloupka 2017-10-31 14:08:08 UTC

I removed the second member via the `etcdctl member remove` before it got added back to the cluster (as we do with the current playbook). The second member no longer reported any cluster ID mismatch and the all the data got propagated there with:

+--------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.240.0.44:2379 | b7231084fa99cfa9 |   3.2.7 |   11 MB |	 true |        54 |     132482 |
| https://10.240.0.46:2379 | bfd22ad79e936107 |   3.2.7 |   11 MB |     false |        54 |     132482 |
| https://10.240.0.56:2379 | 9e8140186cbee2e3 |   3.2.7 |  2.6 MB |     false |        54 |     132482 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+

After re-running the migration playbook again, all members synced and the same data were present in all members. Once the quorum is established again (the first and the second member), the third member does not complain about the ID mismatch (actually it has =never did). Only the original etcd members that made the quorum needed to have the same cluster ID.

Comment 51 Scott Dodson 2017-10-31 15:23:25 UTC

Summarizing various discussions here, we're going to amend the process to remove all but the first member before performing the migration on the first member. We will remove the steps that force a new cluster.

Comment 55 Jan Chaloupka 2017-11-01 20:53:26 UTC

Upstream PR to check there is at least one v2 snapshot before the migration proceeds: https://github.com/openshift/openshift-ansible/pull/5982

Comment 56 Jordan Liggitt 2017-11-01 22:08:30 UTC

opened coreos issue reporting problem adding members after running a v2->v3 migration when no .snap files exist - https://github.com/coreos/etcd/issues/8804

Comment 57 Scott Dodson 2017-11-02 00:05:44 UTC

A bit more detail, this only happens when there have been fewer than 10000 updates made to etcd. Once you've hit 10000 updates a snapshot will be created. When testing in the environment that QE provided by the time we ran the migration it'd been 10 hours and there had already been two snapshots created and the migration went fine.

While this is understandably a testing challenge it's not likely that our customers will face this as they'll be performing the upgrade on clusters that've long since generated at least one snapshot.

A workaround is to first lower the snapshot count on your etcd hosts to trigger a snapshot on all hosts.

echo "ETCD_SNAPSHOT_COUNT=1000" >> /etc/etcd/etcd.conf && systemctl restart etcd

After having done this you should now have a snapshot in /var/lib/etcd/member/snap/*.snap and the migrate will complete normally.

For now, in order to minimize we'll simply add a check to block the migration when there are no snapshots.

Comment 58 liujia 2017-11-02 05:10:29 UTC

Got it. I will continue the test with the workaround. BTW, change the title to be more clear.

Comment 60 liujia 2017-11-03 06:21:38 UTC

Latest build of 3.6 did not merge this pr.

Comment 61 Jan Chaloupka 2017-11-03 09:40:05 UTC

3.6 backport merged as well: https://github.com/openshift/openshift-ansible/pull/5986

Comment 62 Scott Dodson 2017-11-07 01:08:33 UTC

Fix should be in openshift-ansible-3.7.0-0.196.0

Comment 63 liujia 2017-11-07 09:26:44 UTC

Verified on openshift-ansible-3.6.173.0.69-1.git.0.5cdc460.el7.noarch.

Migrate without snap file will fail at task [Check if there is at least one v2 snapshot].
Migrate with snap file will succeed.

Comment 66 errata-xmlrpc 2017-11-28 22:16:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Comment 67 Scott Dodson 2018-03-06 22:00:03 UTC

*** Bug 1552252 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.