1523814 – etcd3 migrate playbook fails with single master topology

Bug 1523814 - etcd3 migrate playbook fails with single master topology

Summary: etcd3 migrate playbook fails with single master topology

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.7.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.7.z
Assignee:	Scott Dodson
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-08 19:30 UTC by Timothy Rees
Modified:	2018-02-26 02:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When running the etcd v2 to v3 migration playbooks as included in the 3.7 release the playbooks incorrectly assumed that all services were HA services (ie: atomic-openshift-master-api and atomic-openshift-master-controllers rather than atomic-openshift-master) which is the norm on 3.7. However the migration playbooks would be executed prior to upgrading to 3.7 so this was incorrect. The migration playbooks have been updated to start and stop the correct services ensuring proper migration.
Clone Of:
Environment:
Last Closed:	2018-01-23 17:59:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0113	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.7 and 3.6 bug fix and enhancement update	2018-01-23 22:55:59 UTC

Description Timothy Rees 2017-12-08 19:30:15 UTC

Description of problem:

When performing a etcd3 schema migration using the provided ansible playbook on a single master topology (v3.6) in preparation for a 3.7 upgrade, the schema migration fails when stopping masters.  Playbook assumes multiple masters when only a single master is configured.

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.6.173.0.75-1.git.0.0a44128.el7.noarch
rpm -q ansible
ansible-2.4.1.0-1.el7.noarch
ansible --version
ansible 2.4.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Aug  2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]

How reproducible:
Always

Steps to Reproduce:
1. Start with deployment of v3.6 OCP with etcd v2 schema
2. (per docs [1]) Enable v3.7 repo, install latest atomic-openshift-utils package
# subscription-manager repos --disable="rhel-7-server-ose-3.6-rpms"     --enable="rhel-7-server-ose-3.7-rpms"     --enable="rhel-7-server-extras-rpms"     --enable="rhel-7-fast-datapath-rpms"
# yum update atomic-openshift-utils
3. (not documented step) Disable v3.7 repo and enable v3.6 repo to avoid issue "[openshift_version : Fail if rpm version and docker image version are different]"
# subscription-manager repos --disable="rhel-7-server-ose-3.7-rpms"     --enable="rhel-7-server-ose-3.6-rpms"
4. (per docs [1]) Run etcd migrate playbook
# ansible-playbook -i .config/openshift/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml

Actual results:

TASK [Stop masters] ************************************************************************************************************************************************
failed: [ocp-master] (item=atomic-openshift-master-controllers) => {"changed": false, "failed": true, "item": "atomic-openshift-master-controllers", "msg": "Could not find the requested service atomic-openshift-master-controllers: host"}
failed: [ocp-master] (item=atomic-openshift-master-api) => {"changed": false, "failed": true, "item": "atomic-openshift-master-api", "msg": "Could not find the requested service atomic-openshift-master-api: host"}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.retry

PLAY RECAP *********************************************************************************************************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0
ocp-master                 : ok=80   changed=4    unreachable=0    failed=1
ocp-node1                  : ok=63   changed=6    unreachable=0    failed=0
ocp-node2                  : ok=63   changed=6    unreachable=0    failed=0


Expected results:

Playbook to complete.

Additional info:

Worked around the issue by replacing lines 31 and 32 of:
https://github.com/openshift/openshift-ansible/blob/0f98871d0f4cf39eded2fcd6041fcea4f83bbed6/playbooks/openshift-etcd/private/migrate.yml#L35
(playbook file: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-etcd/migrate.yml)
With the following:

      master_services:
#      - "{{ openshift.common.service_type + '-master-controllers' }}"
#      - "{{ openshift.common.service_type + '-master-api' }}"
      - "{{ openshift.common.service_type + '-master' }}"

[1] https://docs.openshift.com/container-platform/3.7/install_config/upgrading/migrating_etcd.html#install-config-upgrading-etcd-data-migration

Comment 1 Scott Dodson 2017-12-11 17:07:44 UTC

This technically only happens when using the 3.7 playbooks so I've updated the release to be 3.7.0.

Proposed fix https://github.com/openshift/openshift-ansible/pull/6428

Comment 2 liujia 2017-12-13 03:38:12 UTC

@Scott

It should not be a bug. It seems user used a wrong playbook. AFAIK, etcd v2 migrate to v3 should be done on v3.6(a previous release) before upgrade to v3.7.

So the right steps should be:
If current cluster is v3.6 and it is upgraded from v3.5.
1) Migrate v2 to v3 with v3.6 playbook.
2) Update atomic-openshift-utils for v3.7 upgrade.

If current cluster is v3.6 and it is new installed.
1) No need migrate anymore.
2) Update atomic-openshift-utils for v3.7 upgrade.

Comment 3 Timothy Rees 2017-12-13 07:19:29 UTC

(In reply to liujia from comment #2)
> @Scott
> 
> It should not be a bug. It seems user used a wrong playbook. AFAIK, etcd v2
> migrate to v3 should be done on v3.6(a previous release) before upgrade to
> v3.7.
> 
> So the right steps should be:
> If current cluster is v3.6 and it is upgraded from v3.5.
> 1) Migrate v2 to v3 with v3.6 playbook.
> 2) Update atomic-openshift-utils for v3.7 upgrade.

If this is actually the case then the v3.7 docs need amending to reflect this procedure.  It is not what is currently outlined.

Comment 4 Scott Dodson 2017-12-13 13:55:16 UTC

Jia,

We didn't deliver embedded to external migration playbooks until the 3.7 release so I think it's highly likely that when an admin attempts to upgrade their environment from 3.6 to 3.7 they're going to first be blocked by needing to migrate from v2 to v3. Then when they attempt to run that they'll be informed that they need to migrate from embedded to external which is currently only possible using the 3.7 playbooks. Therefore I think we should support both migration playbooks in the 3.7 playbooks.

Comment 5 liujia 2017-12-14 02:21:11 UTC

(In reply to Scott Dodson from comment #4)
> Jia,
> 
> We didn't deliver embedded to external migration playbooks until the 3.7
> release so I think it's highly likely that when an admin attempts to upgrade
> their environment from 3.6 to 3.7 they're going to first be blocked by
> needing to migrate from v2 to v3. Then when they attempt to run that they'll
> be informed that they need to migrate from embedded to external which is
> currently only possible using the 3.7 playbooks. Therefore I think we should
> support both migration playbooks in the 3.7 playbooks.

Scott,

Yes, we delivered embedded to external on v3.7, but we delivered etcd v2 to v3 on v3.6. When an admin attempts to upgrade cluster from v3.6 to v3.7, they should migrate v2 to v3 with v3.6 playbook first if they want to upgrade to v3.7. So here the playbook should be still v3.6.Then they can do further upgrade through v3.7 playbook. For example, migrating embedded to external and then upgrade to v3.7 with v3.7 playbook.

> "For existing clusters that upgraded to OpenShift Container Platform 3.6, however, the etcd data must be migrated from v2 to v3 as a post-upgrade step. "

We had a strong suggestion about migrate v2 to v3 as a post-upgrade step for an upgraded v3.6 in the doc. So I think user should use v3.6 playbook. But we also have many misleading description and steps about migrate in the doc which will result that playbooks have been updated to v3.7.
 
Anyway, considering compatibility, I agree with you that we can support it in the v3.7 and later versions too.

Comment 7 liujia 2018-01-04 10:52:17 UTC

Reproduced on openshift-ansible-3.7.14-1.git.0.4b35b2d.el7.noarch
1. Install ocp v3.5
2. Upgrade ocp v3.5 to v3.6 without etcd migrate.
3. Do etcd migrate with v3.7 openshift-ansible.

TASK [Stop masters] *********************************************************************************************************************************************************
failed: [x.x.x.x] (item=atomic-openshift-master-controllers) => {"changed": false, "item": "atomic-openshift-master-controllers", "msg": "Could not find the requested service atomic-openshift-master-controllers: host"}
failed: [x.x.x.x] (item=atomic-openshift-master-api) => {"changed": false, "item": "atomic-openshift-master-api", "msg": "Could not find the requested service atomic-openshift-master-api: host"}

Comment 8 liujia 2018-01-04 14:45:41 UTC

Version:
openshift-ansible-3.7.18-1.git.0.a01e769.el7.noarch

1. Install ocp v3.5
2. Upgrade ocp v3.5 to v3.6 without etcd migrate.
3. Do etcd migrate with v3.7 openshift-ansible.

Migrate succeed.
#master-config.yaml
storage-backend:
    - etcd3

Comment 11 errata-xmlrpc 2018-01-23 17:59:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0113

Note You need to log in before you can comment on or make changes to this bug.