Bug 1544395 - [GSS] Migrating etcd Data: v2 to v3 playbook issues (if missing certificates)
Summary: [GSS] Migrating etcd Data: v2 to v3 playbook issues (if missing certificates)
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.6.z
Assignee: Vadim Rutkovsky
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-12 11:20 UTC by Francesco Marchioni
Modified: 2018-04-11 04:44 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-13 08:36:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Error related to /etc/etcd/ca folder (339.56 KB, text/plain)
2018-02-12 11:20 UTC, Francesco Marchioni
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1106 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6 and 3.5 bug fix update 2018-04-16 05:20:11 UTC

Description Francesco Marchioni 2018-02-12 11:20:47 UTC
Created attachment 1394846 [details]
Error related to /etc/etcd/ca folder

Description of problem:
The following issues has been reported during the migration from etcd data v2 to v3 with the migrate playbook. 
If some of the masters do not contain the /etc/etcd/ca folder, then the migration will fail with the following message:

fatal: [vmz2mastp05.lab-boae.paas.gsnetcloud.corp -> vmz1mastp04.lab-boae.paas.gsnetcloud.corp]: FAILED! => {
    "changed": true, 
    "cmd": [
        "openssl", 
        "req", 
        "-new", 
        "-keyout", 
        "server.key", 
        "-config", 
        "/etc/etcd/ca/openssl.cnf", 
        "-out", 
        "server.csr", 
        "-reqexts", 
        "etcd_v3_req", 
        "-batch", 
        "-nodes", 
        "-subj", 
        "/CN=10.106.1.6"
    ], 
    "delta": "0:00:00.024894", 
    "end": "2018-01-31 18:02:42.109612", 
    "rc": 1, 
    "start": "2018-01-31 18:02:42.084718"

The problem is that some other instances of etcd might be already migrated, so we eventually end up with an etcd cluster is inconsistent and is necessary to do a manual rollback. 
 
The expected result would be that
- The existence of /etc/etcd/ca  is not verified by the playbook
- No rollback is available after failure
 

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag


Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Vadim Rutkovsky 2018-02-20 21:13:47 UTC
(In reply to Francesco Marchioni from comment #0)
> The problem is that some other instances of etcd might be already migrated,
> so we eventually end up with an etcd cluster is inconsistent and is
> necessary to do a manual rollback. 

Could you provide full logs of this run with '-vvv' flag? Its not clear which task has requested the cert regen. Most likely https://github.com/openshift/openshift-ansible/pull/7226 would fix it.

> - No rollback is available after failure

There is no playbook to do that, but etcd data is backed up before migration.

Comment 3 Vadim Rutkovsky 2018-02-27 14:55:31 UTC
Fix is available in openshift-ansible-3.6.173.0.104-1-4-g76aa5371e - CA certs are longer required during migration

Comment 4 Vadim Rutkovsky 2018-02-28 09:57:58 UTC
Fix for the issue is not yet released, sorry for the noise

Comment 6 liujia 2018-03-02 09:30:05 UTC
QE can not re-produce it on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch.

steps:
1. HA install ocp v3.5(etcd v2)
2. Upgrade v3.5 to v3.6(etcd v3)
3. Migrate v2 to v3
Succeed.

Checked attached log, the fail happened on vmz1mastp04.lab-boae.paas.gsnetcloud.corp due to no openssl file in ca directory. Compared provided hosts file, we found that vmz1mastp04.lab-boae.paas.gsnetcloud.corp was the first etcd host(openshift_ca_host).

> "If some of the masters do not contain the /etc/etcd/ca folder"

Here need to confirm why ca folder has gone. CA folder should be on each etcd hosts when fresh install v3.5 ocp, and after upgrade to v3_6.

1) if the ca folder was missing during migrate, then we should resolve this issue. However this can not reproduced according to current info. 

2) if the ca folder was missing before migrate, then it is not regarded as a healthy cluster, then whether migrate or other operation will fail totally.

I tried 2), migrate definitely failed.

@Francesco
Could you help provide more info about this bug? Do you know if case 1) or 2) happened to customer? I think more info needed to help verify the bug fixed.

Comment 7 liujia 2018-03-02 09:42:29 UTC
> 2. 2. Upgrade v3.5 to v3.6(etcd v3)

A type error ,should be etcd v2 after upgrade

Comment 8 liujia 2018-03-05 09:28:57 UTC
@Francesco

I can not reproduce the issue on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch. There is no info about if what extra operations done and especially why ca folder gone which should be there by default. 

Just as comment 6, if ca folder was missing on the first etcd host(openshift_ca_host), the pr did not resolve it at all. QE can not verify it from any steps except we can reproduce it. So about pr7226, QE will track it in bz1544399, and for this bug, would u think it can be closed directly?

Comment 10 liujia 2018-03-06 09:20:48 UTC
Version:
openshift-ansible-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch

Steps:
1. HA install ocp v3.5
2. Upgrade v3.5 to v3.6
3. Remove ca folder on first etcd host((This step was not reasonable, we just assume ca folder was missing before migration)
4. Run migrate

Migrate failed at task [etcd_server_certificates : Create the server csr].

TASK [etcd_server_certificates : Create the server csr] *********************************************************************************************************************
fatal: [etcd[1]-> etcd[0]]: FAILED! => {
    "changed": true, 
    "cmd": [
        "openssl", 
        "req", 
        "-new", 
        "-keyout", 
        "server.key", 
        "-config", 
        "/etc/etcd/ca/openssl.cnf", 
        "-out", 
        "server.csr", 
        "-reqexts", 
        "etcd_v3_req", 
        "-batch", 
        "-nodes", 
        "-subj", 
        "/CN=aos-146.lab.sjc.redhat.com"
    ], 
    "delta": "0:00:00.006864", 
    "end": "2018-03-06 04:10:10.600955", 
    "rc": 1, 
    "start": "2018-03-06 04:10:10.594091"
}

STDERR:

error on line -1 of /etc/etcd/ca/openssl.cnf
140100024891296:error:02001002:system library:fopen:No such file or directory:bss_file.c:175:fopen('/etc/etcd/ca/openssl.cnf','rb')
140100024891296:error:2006D080:BIO routines:BIO_new_file:no such file:bss_file.c:182:
140100024891296:error:0E078072:configuration file routines:DEF_LOAD:no such file:conf_def.c:195:


MSG:

non-zero return code

To summary, if ca was missing before migrate, and we need support this scenario, then the issue was not fixed on v3.6.173.0.104. if ca was missing before migrate, and we did not support this scenario, then this bug should be closed as notabug. if ca was missing during migrate, then the bug need assign back to resolve missing issue. So from QE side, this bug need to be assigned back first.

Comment 12 Scott Dodson 2018-03-12 17:15:37 UTC
We have refactored the v2 to v3 migration playbooks so that they no longer trigger etcd certificate generation and only migrate then scale the cluster back up. As such, the task that this was originally reported against will no longer execute and the bug should go away.

If I were to guess as to how this problem happened in the first place this is how I would attempt to reproduce the issue.

1) Provision a 3.5 cluster with 3 etcd hosts
2) Upgrade to 3.6
3) Alter the order of [etcd] hosts so that the host that has /etc/etcd/ca is no longer the first host in the [etcd] group
4) Run v2 to v3 migration

Comment 14 liujia 2018-03-13 08:18:11 UTC
(In reply to Scott Dodson from comment #12)
> We have refactored the v2 to v3 migration playbooks so that they no longer
> trigger etcd certificate generation and only migrate then scale the cluster
> back up. As such, the task that this was originally reported against will no
> longer execute and the bug should go away.
> 
> If I were to guess as to how this problem happened in the first place this
> is how I would attempt to reproduce the issue.
> 
> 1) Provision a 3.5 cluster with 3 etcd hosts
> 2) Upgrade to 3.6
> 3) Alter the order of [etcd] hosts so that the host that has /etc/etcd/ca is
> no longer the first host in the [etcd] group
> 4) Run v2 to v3 migration

@Scott
All etcd hosts in the [etcd] group has /etc/etcd/ca directory by default. A fresh installation of v3.5 with ha etcd would create /etc/etcd/ca directory for all etcd hosts in [etcd] group. And upgrade from v3.5 to v3.6 will not change this directory. So your steps should not work too.


Note You need to log in before you can comment on or make changes to this bug.