Created attachment 1394846 [details] Error related to /etc/etcd/ca folder Description of problem: The following issues has been reported during the migration from etcd data v2 to v3 with the migrate playbook. If some of the masters do not contain the /etc/etcd/ca folder, then the migration will fail with the following message: fatal: [vmz2mastp05.lab-boae.paas.gsnetcloud.corp -> vmz1mastp04.lab-boae.paas.gsnetcloud.corp]: FAILED! => { "changed": true, "cmd": [ "openssl", "req", "-new", "-keyout", "server.key", "-config", "/etc/etcd/ca/openssl.cnf", "-out", "server.csr", "-reqexts", "etcd_v3_req", "-batch", "-nodes", "-subj", "/CN=10.106.1.6" ], "delta": "0:00:00.024894", "end": "2018-01-31 18:02:42.109612", "rc": 1, "start": "2018-01-31 18:02:42.084718" The problem is that some other instances of etcd might be already migrated, so we eventually end up with an etcd cluster is inconsistent and is necessary to do a manual rollback. The expected result would be that - The existence of /etc/etcd/ca is not verified by the playbook - No rollback is available after failure Description of problem: Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
(In reply to Francesco Marchioni from comment #0) > The problem is that some other instances of etcd might be already migrated, > so we eventually end up with an etcd cluster is inconsistent and is > necessary to do a manual rollback. Could you provide full logs of this run with '-vvv' flag? Its not clear which task has requested the cert regen. Most likely https://github.com/openshift/openshift-ansible/pull/7226 would fix it. > - No rollback is available after failure There is no playbook to do that, but etcd data is backed up before migration.
Fix is available in openshift-ansible-3.6.173.0.104-1-4-g76aa5371e - CA certs are longer required during migration
Fix for the issue is not yet released, sorry for the noise
QE can not re-produce it on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch. steps: 1. HA install ocp v3.5(etcd v2) 2. Upgrade v3.5 to v3.6(etcd v3) 3. Migrate v2 to v3 Succeed. Checked attached log, the fail happened on vmz1mastp04.lab-boae.paas.gsnetcloud.corp due to no openssl file in ca directory. Compared provided hosts file, we found that vmz1mastp04.lab-boae.paas.gsnetcloud.corp was the first etcd host(openshift_ca_host). > "If some of the masters do not contain the /etc/etcd/ca folder" Here need to confirm why ca folder has gone. CA folder should be on each etcd hosts when fresh install v3.5 ocp, and after upgrade to v3_6. 1) if the ca folder was missing during migrate, then we should resolve this issue. However this can not reproduced according to current info. 2) if the ca folder was missing before migrate, then it is not regarded as a healthy cluster, then whether migrate or other operation will fail totally. I tried 2), migrate definitely failed. @Francesco Could you help provide more info about this bug? Do you know if case 1) or 2) happened to customer? I think more info needed to help verify the bug fixed.
> 2. 2. Upgrade v3.5 to v3.6(etcd v3) A type error ,should be etcd v2 after upgrade
@Francesco I can not reproduce the issue on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch. There is no info about if what extra operations done and especially why ca folder gone which should be there by default. Just as comment 6, if ca folder was missing on the first etcd host(openshift_ca_host), the pr did not resolve it at all. QE can not verify it from any steps except we can reproduce it. So about pr7226, QE will track it in bz1544399, and for this bug, would u think it can be closed directly?
Version: openshift-ansible-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch Steps: 1. HA install ocp v3.5 2. Upgrade v3.5 to v3.6 3. Remove ca folder on first etcd host((This step was not reasonable, we just assume ca folder was missing before migration) 4. Run migrate Migrate failed at task [etcd_server_certificates : Create the server csr]. TASK [etcd_server_certificates : Create the server csr] ********************************************************************************************************************* fatal: [etcd[1]-> etcd[0]]: FAILED! => { "changed": true, "cmd": [ "openssl", "req", "-new", "-keyout", "server.key", "-config", "/etc/etcd/ca/openssl.cnf", "-out", "server.csr", "-reqexts", "etcd_v3_req", "-batch", "-nodes", "-subj", "/CN=aos-146.lab.sjc.redhat.com" ], "delta": "0:00:00.006864", "end": "2018-03-06 04:10:10.600955", "rc": 1, "start": "2018-03-06 04:10:10.594091" } STDERR: error on line -1 of /etc/etcd/ca/openssl.cnf 140100024891296:error:02001002:system library:fopen:No such file or directory:bss_file.c:175:fopen('/etc/etcd/ca/openssl.cnf','rb') 140100024891296:error:2006D080:BIO routines:BIO_new_file:no such file:bss_file.c:182: 140100024891296:error:0E078072:configuration file routines:DEF_LOAD:no such file:conf_def.c:195: MSG: non-zero return code To summary, if ca was missing before migrate, and we need support this scenario, then the issue was not fixed on v3.6.173.0.104. if ca was missing before migrate, and we did not support this scenario, then this bug should be closed as notabug. if ca was missing during migrate, then the bug need assign back to resolve missing issue. So from QE side, this bug need to be assigned back first.
We have refactored the v2 to v3 migration playbooks so that they no longer trigger etcd certificate generation and only migrate then scale the cluster back up. As such, the task that this was originally reported against will no longer execute and the bug should go away. If I were to guess as to how this problem happened in the first place this is how I would attempt to reproduce the issue. 1) Provision a 3.5 cluster with 3 etcd hosts 2) Upgrade to 3.6 3) Alter the order of [etcd] hosts so that the host that has /etc/etcd/ca is no longer the first host in the [etcd] group 4) Run v2 to v3 migration
(In reply to Scott Dodson from comment #12) > We have refactored the v2 to v3 migration playbooks so that they no longer > trigger etcd certificate generation and only migrate then scale the cluster > back up. As such, the task that this was originally reported against will no > longer execute and the bug should go away. > > If I were to guess as to how this problem happened in the first place this > is how I would attempt to reproduce the issue. > > 1) Provision a 3.5 cluster with 3 etcd hosts > 2) Upgrade to 3.6 > 3) Alter the order of [etcd] hosts so that the host that has /etc/etcd/ca is > no longer the first host in the [etcd] group > 4) Run v2 to v3 migration @Scott All etcd hosts in the [etcd] group has /etc/etcd/ca directory by default. A fresh installation of v3.5 with ha etcd would create /etc/etcd/ca directory for all etcd hosts in [etcd] group. And upgrade from v3.5 to v3.6 will not change this directory. So your steps should not work too.