1463493 – migrate fail when use embedded etcd

Bug 1463493 - migrate fail when use embedded etcd

Summary: migrate fail when use embedded etcd

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Jan Chaloupka
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-21 06:23 UTC by Anping Li
Modified:	2017-10-12 11:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-12 11:43:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Anping Li 2017-06-21 06:23:00 UTC

Description of problem:
The task 'Get member item health status' could not detect the cluster status when use the embedded etcd .the migrate fails.

Version-Release number of selected component (if applicable):
openshift-ansible/pull/4492

How reproducible:
always

Steps to Reproduce:
1. install OCP v3.5 with dedicated etcd clusters
2. upgrade to v3.6
3. migrate to etcd v3
   anible-playbook openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml


Actual results:
TASK [etcd_migrate : Get member item health status] ****************************

TASK [etcd_migrate : Check the etcd cluster health] ****************************
fatal: [host-8-175-59.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Etcd member 172.16.120.249 is not healthy

    to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-etcd/migrate.retry

PLAY RECAP *********************************************************************
host-8-175-59.host.centralci.eng.rdu2.redhat.com : ok=12   changed=2    unreachable=0    failed=1   
localhost                  : ok=10   changed=0    unreachable=0    failed=0   

Expected results:

Expected results:


Additional info:

Comment 1 Anping Li 2017-06-21 06:34:05 UTC

The example inventory
[OSEv3:children]
masters
nodes
[OSEv3:vars]
ansible_ssh_user=root
xxxx
xxxx
[masters]
master.example.com
[nodes]
master.example.com
node.example.com

Comment 2 Jan Chaloupka 2017-06-22 13:07:21 UTC

The `etcd_migrate` should not be run at all if the [etcd] hosts list is empty. After updating the tasks and running the etcd_migrate over the oo_first_master, I get the following error:

2017-06-22 09:01:00.347242 I | etcdserver/membership: added member 395f2befffe04859 [https://172.16.186.29:7001] to cluster 0
2017-06-22 09:01:00.347442 N | etcdserver/membership: set the initial cluster version to 3.2
2017-06-22 09:01:00.347452 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.1.9 is lower than determined cluster version: 3.2).

# openshift version
openshift v3.6.121
kubernetes v1.6.1+5115d708d7
etcd 3.2.0

So before the embedded etcd migration is even run the etcd client x.y version must be at least as high as etcd server x.y version.

Comment 3 Jan Chaloupka 2017-06-23 08:49:45 UTC

With the current implementation of the migration, we can not migrate embedded etcd. The migration workflow corresponds to (offline migration):

1. disable master API
2. disable etcd members
3. migrate each etcd member
4. enable etcd members
5. re-attach leases
6. validate data (optional atm)
7. enable master API

Given the embedded etcd is part of the master, there is no such thing as "disable/enable etcd". So in order to re-attach leases, one needs to start the master API. Which makes the validation impossible (once the master is started, v3 data starts to change). At the same time the leases can not be re-attached as the master API is enable outside of the etcd_migrate role.

At most we can only migrate the v2 data to v3. However, without any validation or leases re-attaching.

Comment 4 Jan Chaloupka 2017-06-23 09:44:13 UTC

Or, we could run the etcd daemon for the time of the re-attaching (and validation). Once done, we just stop the daemon and continue. One disadvantage of the approach is assumption the etcd rpm will always be installed on the master host. So far the rpm has been installed due to etcdctl command. Once/If the etcd rpm gets split into etcd rpm (with etcd binary) and etcd-client rpm (with etcdctl binary), this approach will not work. Unless we install the etcd binary explicitly.

Comment 5 Anping Li 2017-06-23 09:55:33 UTC

Jan, We can not build etcd cluster with embedded Etcd, so migrate needn't count etcd member.  
And Scott had mentioned that OCP will not support embedded Etcd, but I have not found the official announcement.  I think we can give this bug low priority.

Comment 6 Jan Chaloupka 2017-06-23 10:24:52 UTC

Upstream PR: https://github.com/openshift/openshift-ansible/pull/4558

Comment 9 Jan Chaloupka 2017-10-12 11:43:13 UTC

The v2->v3 migration of an embedded etcd is depricated. Instead, one needs to run:
1. `playbooks/byo/openshift-etcd/embedded2external.yml` to migrate the embedded etcd to an external one (see https://github.com/openshift/openshift-ansible/pull/5672)
2. then `playbooks/byo/openshift-etcd/migrate.yml` to migrate the v2 data to v3 data

Upstream PR to enforce the limitation: https://github.com/openshift/openshift-ansible/pull/5733

Note You need to log in before you can comment on or make changes to this bug.