1544737 – first containerized etcd not upgraded to latest image when migrating etcd v2-> v3

Bug 1544737 - first containerized etcd not upgraded to latest image when migrating etcd v2-> v3

Summary: first containerized etcd not upgraded to latest image when migrating etcd v2-...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Vadim Rutkovsky
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-13 12:11 UTC by daniel
Modified:	2018-04-12 06:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-12 06:03:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1106	0	None	None	None	2018-04-12 06:04:58 UTC

Description daniel 2018-02-13 12:11:16 UTC

Description of problem:
When migrating etcd from v2 -> v3 in a containerized env the first etcd container never gets updated to the latest available image while the others are


Version-Release number of selected component (if applicable):
 3.6.173.0.96

How reproducible:


Steps to Reproduce:
1. Upgrade existing OCP 3.6 to latest available 3.6 version:  3.6.173.0.96 as well as ansible et al
2. run # ansible-playbook -i /etc/ansible/hosts  /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml
3. check versions, e.g.: 
# export ETCD_LISTEN_CLIENT_URLS="https://10.0.0.152:2379,https://10.0.0.153:2379,https://10.0.0.154:2379"
# ETCDCTL_API=3 /usr/bin/etcdctl --cert="/etc/etcd/peer.crt" --key="/etc/etcd/peer.key" --cacert="/etc/etcd/ca.crt"  --endpoints=$ETCD_LISTEN_CLIENT_URLS endpoint status -w table


Actual results:
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.0.0.152:2379   | ec57d625662aa394 |   3.2.9 |  9.2 MB |      true |        98 |    4960148 |
| https://10.0.0.153:2379   | 316d70307263b11d |  3.2.11 |  9.2 MB |     false |        98 |    4960148 |
| https://10.0.0.154:2379   | f1ee01ab7b3d594f |  3.2.11 |  9.2 MB |     false |        98 |    4960148 |
+---------------------------+------------------+---------+---------+-----------+-----------+------------+

Expected results:
all etcd nodes are at the same version, i.e. 3.2.11

Additional info:
Tried this several times and always had the same result. I think the first node (etcd on masters) active is always skipped/ignored. When going to that node and stopping etcd_container, removing the image and docker and then starting it again makes al etcds to be at the same version

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2018-02-20 23:02:37 UTC

Daniel,

We're going to strip out all the extraneous steps that affect the etcd installation. This is surely an interesting side effect, basically we do the migration on the first host and then we currently run the scale up playbooks on the other two which is pretty heavy handed. It will replace certificates and effectively re-install etcd.

When we update the playbooks just to re-add those hosts and start them back up the outcome would be that all etcd hosts remain unchanged aside from data migration.

Vadim,

Not sure whether we should mark these all as dupes or not, they're all different symptoms of the same root cause. I guess for now lets leave them all open that way QE can test to ensure that each different symptom is resolved by our work.

Comment 2 Vadim Rutkovsky 2018-02-22 12:40:16 UTC

I'll check if this is still reproducible with https://github.com/openshift/openshift-ansible/pull/7226 - I guess etcd upgrade during migrate is actually unwanted

Comment 3 Vadim Rutkovsky 2018-02-27 14:54:44 UTC

Fix is available in openshift-ansible-3.6.173.0.104-1-4-g76aa5371e - the etcd migrate playbook no longer includes scaleup, so container version won't change

Comment 5 liujia 2018-03-07 06:32:41 UTC

Version:
openshift-ansible-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch

Steps:
1. HA containerized install ocp v3.5 (v3.5.5.31.48) with container etcd version is v3.2.7.
# docker run -it --entrypoint rpm registry.access.redhat.com/rhel7/etcd:3.2.7 -qa etcd
etcd-3.2.7-1.el7.x86_64

2. Upgrade v3.5 to latest ocp v3.6.173.0.104

+-----------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                ENDPOINT                 |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://aos-138.lab.sjc.redhat.com:2379 | b0c1d3f602268dc8 |   3.2.7 |   25 kB |      true |         9 |      75544 |
| https://aos-152.lab.sjc.redhat.com:2379 | 53c604bc13ce5de3 |   3.2.7 |   25 kB |     false |         9 |      75545 |
| https://aos-155.lab.sjc.redhat.com:2379 | 122ce3db037f9bf3 |   3.2.7 |   25 kB |     false |         9 |      75546 |
+-----------------------------------------+------------------+---------+---------+-----------+-----------+------------+

3. Do etcd migrate,migration succeed. Checked etcd versions in the cluster.

+-----------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                ENDPOINT                 |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://aos-138.lab.sjc.redhat.com:2379 | b0c1d3f602268dc8 |   3.2.7 |  7.7 MB |      true |       132 |      81829 |
| https://aos-152.lab.sjc.redhat.com:2379 | 658edf5257cc6250 |  3.2.11 |  7.7 MB |     false |       132 |      81829 |
| https://aos-155.lab.sjc.redhat.com:2379 | 18822de21a345140 |  3.2.11 |  7.7 MB |     false |       132 |      81829 |
+-----------------------------------------+------------------+---------+---------+-----------+-----------+------------+

Then checked pr was not merged into latest v3.6 build.
# rpm -qa|grep openshift-ansible
openshift-ansible-playbooks-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch
openshift-ansible-docs-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch
openshift-ansible-roles-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch
openshift-ansible-lookup-plugins-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch
openshift-ansible-callback-plugins-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch
openshift-ansible-filter-plugins-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch
openshift-ansible-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch

# grep -r "scaleup" /usr/share/ansible/openshift-ansible/playbooks/common/openshift-etcd/migrate.yml 
- include: ./scaleup.yml

Change modified to wait for the pr merged.

Comment 6 Vadim Rutkovsky 2018-03-07 16:56:16 UTC

Right, the PR is merged, but the release is not yet prepared

Comment 7 Scott Dodson 2018-03-12 17:22:41 UTC

openshift-ansible-3.6.173.0.105-1 no longer calls scaleup

Comment 8 liujia 2018-03-13 08:27:22 UTC

Blocked verify due to bz1554707

Comment 9 liujia 2018-03-20 09:43:22 UTC

Version:
openshift-ansible-3.6.173.0.110-1.git.0.ca81843.el7.noarch

Steps:
1. HA containerized install ocp v3.5 with container etcd version is v3.2.7.
# docker run -it --entrypoint rpm registry.access.redhat.com/rhel7/etcd:3.2.7 -qa etcd
etcd-3.2.7-1.el7.x86_64

2. Upgrade v3.5 to latest ocp v3.6.173.0.110
+-------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                 ENDPOINT                  |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://ip-172-18-2-113.ec2.internal:2379 | 9459974f264be826 |   3.2.7 |   25 kB |      true |        13 |      48877 |
| https://ip-172-18-2-175.ec2.internal:2379 | c1b23e750866c037 |   3.2.7 |   25 kB |     false |        13 |      48878 |
| https://ip-172-18-10-56.ec2.internal:2379 |  be6ae0df781edce |   3.2.7 |   25 kB |     false |        13 |      48878 |
+-------------------------------------------+------------------+---------+---------+-----------+-----------+------------+


3. Do etcd migrate,migration succeed. Checked etcd versions in the cluster.

+-------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                 ENDPOINT                  |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://ip-172-18-2-113.ec2.internal:2379 | 9459974f264be826 |   3.2.7 |  8.3 MB |      true |        18 |      63557 |
| https://ip-172-18-2-175.ec2.internal:2379 | b7256a130b6be421 |   3.2.7 |  8.3 MB |     false |        18 |      63557 |
| https://ip-172-18-10-56.ec2.internal:2379 | 289a4e3dbe9b8ac7 |   3.2.7 |  8.3 MB |     false |        18 |      63557 |
+-------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

I think it is expected now. Etcd migration should not do etcd upgrade or scaleup. So all etcd version keep the same version with it was before migration and keep the same with each other in the cluster.

Comment 12 errata-xmlrpc 2018-04-12 06:03:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106

Note You need to log in before you can comment on or make changes to this bug.