Description of problem: etcd-3.2.15-2.el7 which shipped in RHEL 7.5 was rebuilt using golang 1.9 which has stricter certificate validation rules. Namely, that the cert must have not only a CN but a SAN that matches. This is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1536217 where the API server in OCP 3.9 is rebuilt using golang-1.9 and had similar problems. Version-Release number of the following components: etcd-3.2.15-2.el7 How reproducible: Requires an environment provisioned using playbooks prior to OCP 3.6 where we did not properly add the hostname to the SAN. You can verify this is the case by looking at the serving cert BAD # openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative X509v3 Subject Alternative Name: IP Address:10.10.10.10 GOOD # openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative X509v3 Subject Alternative Name: IP Address:10.10.10.10, DNS:master.example.com Steps to Reproduce: 1. Provision a cluster using 3.5 2. Upgrade to 3.6 3. Upgrade to 3.7 Alternatively, you could provision a 3.6 cluster, modify the certs to only have the IP address in the SAN like so, then perform a 3.7 upgrade. cd /etc/etcd/ SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \ -config /etc/etcd/ca/openssl.cnf \ -out server.csr \ -reqexts etcd_v3_req -batch -nodes \ -subj /CN=ose3-master.example.com SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \ -config /etc/etcd/ca/openssl.cnf \ -out server.crt \ -in server.csr \ -extensions etcd_v3_ca_server -batch Actual results: Playbooks abort because etcdctl cannot verify cluster health TASK [etcd : Restart etcd] ***************************************************** Tuesday 10 April 2018 16:17:33 +0000 (0:00:00.800) 0:06:07.732 ********* changed: [host.example.com] TASK [etcd : Verify cluster is healthy] **************************************** Tuesday 10 April 2018 16:17:34 +0000 (0:00:01.321) 0:06:09.054 ********* FAILED - RETRYING: Verify cluster is healthy (3 retries left). FAILED - RETRYING: Verify cluster is healthy (2 retries left). FAILED - RETRYING: Verify cluster is healthy (1 retries left). fatal: [host.example.com]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://host.example.com:2379", "cluster-health"], "delta": "0:00:00.065643", "end": "2018-04-10 16:18:06.654776", "msg": "non-zero return code", "rc": 4, "start": "2018-04-10 16:18:06.589133", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-31-61-120.eu-west-1.compute.internal\n\nerror #0: x509: certificate is not valid for any names, but wanted to match host.example.com", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match host.example.com", "", "error #0: x509: certificate is not valid for any names, but wanted to match ip-172-31-61-120.eu-west-1.compute.internal"], "stdout": "cluster may be unhealthy: failed to list members", "stdout_lines": ["cluster may be unhealthy: failed to list members"]} Expected results: Successful upgrade Additional info: Please attach logs from ansible-playbook with the -vvv flag
The fix is to backport this to 3.7, this will ensure during the upgrade we check for bad certs and correct them. https://github.com/openshift/openshift-ansible/pull/6859
Workaround: On each host downgrade/upgrade to etcd-3.2.15-1.el7 yum downgrade etcd-3.2.15-1.el7 && systemctl restart etcd yum upgrade etcd-3.2.15-1.el7 && systemctl restart etcd Also, note that this does not affect actual operation of OCP 3.7 or earlier versions but it will prevent the /usr/bin/etcdctl from interacting with the cluster as the upgrade playbooks do.
We should also backport https://github.com/openshift/openshift-ansible/pull/6926 so that it automatically determines which host is the etcd CA host.
https://github.com/openshift/openshift-ansible/pull/7914
Could reproduce this issue with released 3.7 openshift-ansible - openshift-ansible-3.7.42-1.git.2.9ee4e71.el7.noarch.rpm. Reproduce steps: 1. Prepare v3.5.5.31.67 ocp cluster with etcd-3.1.9-2 installed, the SAN of etcd server.crt only has IP address [root@ip-172-18-7-242 ~]# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative X509v3 Subject Alternative Name: IP Address:172.18.7.242 [root@ip-172-18-7-242 ~]# rpm -q etcd etcd-3.1.9-2.el7.x86_64 2. Upgrade it to v3.6.173.0.113 without upgrading etcd 3. Upgrade it to v3.7.42, also upgrade etcd to etcd-3.2.15-2 using openshift-ansible-3.7.42-1.git.2.9ee4e71.el7.noarch.rpm. Upgrade fails as below: TASK [etcd : Verify cluster is healthy] ************************************************************************************************************************************* FAILED - RETRYING: Verify cluster is healthy (3 retries left). FAILED - RETRYING: Verify cluster is healthy (2 retries left). FAILED - RETRYING: Verify cluster is healthy (1 retries left). fatal: [ec2-52-90-234-197.compute-1.amazonaws.com]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://ip-172-18-7-242.ec2.internal:2379", "cluster-health"], "delta": "0:00:00.050016", "end": "2018-04-26 04:46:52.074893", "failed": true, "msg": "non-zero return code", "rc": 4, "start": "2018-04-26 04:46:52.024877", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal\n\nerror #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal", "", "error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal"], "stdout": "cluster may be unhealthy: failed to list members", "stdout_lines": ["cluster may be unhealthy: failed to list members"]} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry Verify this bug with the above steps using openshift-ansible-3.7.44-1.git.6.f5308bb.el7. Upgrade fails as the same error, some post checks on the etcd host. [root@ip-172-18-7-242 ~]# rpm -q etcd etcd-3.2.15-2.el7.x86_64 [root@ip-172-18-7-242 ~]# ll /etc/etcd/ total 56 drwx------. 5 root root 212 Apr 26 05:01 ca -rw-------. 1 etcd etcd 1895 Apr 26 01:17 ca.crt -rw-r--r--. 1 root root 654 Apr 26 05:05 etcd.conf -rw-r--r--. 1 root root 539 Apr 26 01:17 etcd.conf.23499.2018-04-26@04:28:09~ -rw-r--r--. 1 root root 567 Apr 26 04:28 etcd.conf.23646.2018-04-26@04:28:19~ -rw-r--r--. 1 root root 1341 Feb 21 2017 etcd.conf.5198.2018-04-26@01:17:27~ -rw-r--r--. 1 root root 1686 Feb 6 09:23 etcd.conf.rpmnew drwx------. 4 root root 221 Apr 26 01:20 generated_certs -rw-------. 1 etcd etcd 5902 Apr 26 01:17 peer.crt -rw-r--r--. 1 root root 1001 Apr 26 01:17 peer.csr -rw-------. 1 etcd etcd 1704 Apr 26 01:17 peer.key -rw-------. 1 etcd etcd 5858 Apr 26 01:17 server.crt -rw-r--r--. 1 root root 1001 Apr 26 01:17 server.csr -rw-------. 1 etcd etcd 1704 Apr 26 01:17 server.key [root@ip-172-18-7-242 ~]# ll /etc/etcd/generated_certs/ total 36 -rw-r--r--. 1 root root 14415 Apr 26 01:17 etcd_ca.tgz drwx------. 2 root root 122 Apr 26 01:17 etcd-ip-172-18-7-242.ec2.internal -rw-r--r--. 1 root root 10743 Apr 26 01:17 etcd-ip-172-18-7-242.ec2.internal.tgz drwx------. 2 root root 122 Apr 26 01:20 openshift-master-ip-172-18-6-109.ec2.internal -rw-r--r--. 1 root root 6621 Apr 26 01:20 openshift-master-ip-172-18-6-109.ec2.internal.tgz [root@ip-172-18-7-242 ~]# ll /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal/ total 36 -rw-r--r--. 3 root root 1895 Apr 26 01:17 ca.crt -rw-r--r--. 1 root root 5976 Apr 26 05:01 peer.crt -rw-r--r--. 1 root root 1041 Apr 26 05:01 peer.csr -rw-r--r--. 1 root root 1708 Apr 26 05:01 peer.key -rw-r--r--. 1 root root 5933 Apr 26 05:01 server.crt -rw-r--r--. 1 root root 1041 Apr 26 05:01 server.csr -rw-r--r--. 1 root root 1704 Apr 26 05:01 server.key [root@ip-172-18-7-242 ~]# openssl x509 -in /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal/server.crt -noout -text | grep -A1 Alternative X509v3 Subject Alternative Name: IP Address:172.18.7.242, DNS:ip-172-18-7-242.ec2.internal [root@ip-172-18-7-242 ~]# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative X509v3 Subject Alternative Name: IP Address:172.18.7.242 The /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal/server.crt was generated as expected, but /etc/etcd/server.crt was not replaced by the new one. Seems the tarball of new generated etcd certs was not updated during etcd certificates redeployment. TASK [etcd : Create a tarball of the etcd certs] **************************************************************************************************************************** ok: [ec2-52-90-234-197.compute-1.amazonaws.com -> ec2-52-90-234-197.compute-1.amazonaws.com] => {"changed": false, "cmd": "tar -czvf /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal.tgz\n -C /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal .", "failed": false, "rc": 0, "stdout": "skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal.tgz exists", "stdout_lines": ["skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal.tgz exists"]} Full upgrade log attached.
Created attachment 1427146 [details] Full ansible log when upgrading to ocp-3.7 and etcd-3.2.15-2
Not entirely sure if these steps are correct - 3.5 should be installed with etcd v2.x. This is probably why existing certs are not being replaced. I'll prepare a PR to make sure certs from /etc/etcd/generated_certs are always copied to /etc/etcd/server.crt Scott, does that seem correct to you?
It should be fine if etcd-3.1.9 was the starting state. The problem is how the certificates were generated by openshift-ansible in the 3.5 playbooks which it looks like QE has accurately reproduced based on the first output in comment 7. Yes, I think making sure that the server.crt is always updated when we call the playbook is fine. When I'd tested your PR that was a problem I ran into and I added this commit which resolved the problem for me. https://github.com/openshift/openshift-ansible/pull/7914/commits/1c96402177ddc2c9887780047e7a6ed1554b603b BTW, the build openshift-ansible-3.7.44-1.git.6.f5308bb.el7 comes from this branch https://github.com/openshift/openshift-ansible/tree/openshift-ansible-3.7-fix because while the pull request referenced here has been merged into release-3.7 we had to 'hotfix' this atop the 3.7.44 tag.
(In reply to Scott Dodson from comment #10) > Yes, I think making sure that the server.crt is always updated when we call > the playbook is fine. When I'd tested your PR that was a problem I ran into > and I added this commit which resolved the problem for me. > > https://github.com/openshift/openshift-ansible/pull/7914/commits/ > 1c96402177ddc2c9887780047e7a6ed1554b603b Right, that should do the trick. > BTW, the build openshift-ansible-3.7.44-1.git.6.f5308bb.el7 comes from this > branch > https://github.com/openshift/openshift-ansible/tree/openshift-ansible-3.7- > fix because while the pull request referenced here has been merged into > release-3.7 we had to 'hotfix' this atop the 3.7.44 tag. Okay, I'll check if 1c96402177ddc2c9887780047e7a6ed1554b603b fixes this. Moving to MODIFIED, as its not yet released
Additional fix in openshift-ansible-3.7.44-1.git.7.49cfcd8.el7 https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=678900
First attempt at upgrading failed with this in the master-api log: Apr 26 15:01:33 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: I0426 15:01:33.840114 2930 start_master.go:530] Starting master on 0.0.0.0:8443 (v3.7.44) │···························· Apr 26 15:01:33 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: I0426 15:01:33.840131 2930 start_master.go:531] Public master address is https://ec2-34-226-216-106.compute-1.amazonaws.com:8443 │···························· Apr 26 15:01:33 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: I0426 15:01:33.840145 2930 start_master.go:538] Using images from "registry.reg-aws.openshift.com:443/openshift3/ose-<component>:v3.7.44" │···························· Apr 26 15:01:33 ip-172-18-4-245.ec2.internal openshift[2930]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please │···························· Apr 26 15:01:33 ip-172-18-4-245.ec2.internal openshift[2930]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please │···························· Apr 26 15:01:39 ip-172-18-4-245.ec2.internal openshift[2930]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please │···························· Apr 26 15:02:09 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: F0426 15:02:09.549256 2930 start_api.go:67] [could not reach etcd(v2): client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for │···························· Apr 26 15:02:09 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: , grpc: timed out when dialing] │···························· Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a │···························· Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: Failed to start Atomic OpenShift Master API. │···························· Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: Unit atomic-openshift-master-api.service entered failed state. │···························· Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: atomic-openshift-master-api.service failed. Attaching ansible log (it is the last run in this log). Going to try again with a fresh cluster. This cluster went through some hardship getting to the point where it was ready to upgrade to 3.7.
Created attachment 1427394 [details] Log for 3.6 -> 3.7 upgrade
I am seeing the error message from the api logs in comment 16 (x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal) during the 3.5->3.6 upgrade - hit it in each of 2 attempts. Opened bug 1572377 to track that separately from this bug.
I also hit the same problem as mentioned in comment #16 while upgrading from 3.6 to 3.7 Apr 26 21:26:34 ip-172-31-55-174.us-west-2.compute.internal atomic-openshift-master-api[10578]: I0426 21:26:34.322911 10578 start_master.go:538] Using images from "registry.reg -aws.openshift.com:443/openshift3/ose-<component>:v3.7.44" Apr 26 21:26:34 ip-172-31-55-174.us-west-2.compute.internal openshift[10578]: Failed to dial ip-172-31-55-174.us-west-2.compute.internal:2379: connection error: desc = "transport : x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal"; please retry. Apr 26 21:26:34 ip-172-31-55-174.us-west-2.compute.internal openshift[10578]: Failed to dial ip-172-31-55-174.us-west-2.compute.internal:2379: connection error: desc = "transport : x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal"; please retry. Apr 26 21:26:39 ip-172-31-55-174.us-west-2.compute.internal openshift[10578]: Failed to dial ip-172-31-55-174.us-west-2.compute.internal:2379: connection error: desc = "transport : x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal"; please retry. Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal atomic-openshift-master-api[10578]: F0426 21:27:09.944535 10578 start_api.go:67] [could not reach etcd(v2): client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal atomic-openshift-master-api[10578]: , grpc: timed out when dialing] Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal systemd[1]: Failed to start Atomic OpenShift Master API.
Tried again with the new build - openshift-ansible-3.7.44-1.git.7.49cfcd8.el7. Still failed with the same error. /etc/etcd/server.crt is not got updated before etcd upgrade. Key steps in the upgrade log: [root@gpei-preserved 0427]# grep "Create a tarball of the etcd certs" upgrade_1 -nA1 2288:TASK [etcd : Create a tarball of the etcd certs] **************************************************************************************************************************** 2289-ok: [ec2-52-55-105-116.compute-1.amazonaws.com -> ec2-52-55-105-116.compute-1.amazonaws.com] => {"changed": false, "cmd": "tar -czvf /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal.tgz\n -C /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal .", "failed": false, "rc": 0, "stdout": "skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal.tgz exists", "stdout_lines": ["skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal.tgz exists"]} -- 2378:TASK [etcd : Create a tarball of the etcd certs] **************************************************************************************************************************** 2379-changed: [ec2-54-161-17-75.compute-1.amazonaws.com -> ec2-52-55-105-116.compute-1.amazonaws.com] => {"changed": true, "cmd": ["tar", "-czvf", "/etc/etcd/generated_certs/openshift-master-ip-172-18-13-152.ec2.internal.tgz", "-C", "/etc/etcd/generated_certs/openshift-master-ip-172-18-13-152.ec2.internal", "."], "delta": "0:00:00.014403", "end": "2018-04-26 22:23:19.793260", "failed": false, "rc": 0, "start": "2018-04-26 22:23:19.778857", "stderr": "", "stderr_lines": [], "stdout": "./\n./master.etcd-client.key\n./master.etcd-client.csr\n./master.etcd-client.crt\n./master.etcd-ca.crt", "stdout_lines": ["./", "./master.etcd-client.key", "./master.etcd-client.csr", "./master.etcd-client.crt", "./master.etcd-ca.crt"]} [root@gpei-preserved 0427]# grep -r "Delete existing certificate tarball if certs need to be regenerated" upgrade_1 -nA1 2375:TASK [etcd : Delete existing certificate tarball if certs need to be regenerated] ******************************************************************************************* 2376-changed: [ec2-54-161-17-75.compute-1.amazonaws.com -> ec2-52-55-105-116.compute-1.amazonaws.com] => {"changed": true, "failed": false, "path": "/etc/etcd/generated_certs/openshift-master-ip-172-18-13-152.ec2.internal.tgz", "state": "absent"} The proposed fix in https://github.com/openshift/openshift-ansible/commit/4a2c8415f8b45c897be72a3610a8ed0fd0509c68 only removed the client cert tarball used for masters, didn't delete the etcd server cert tarball. After checking the code, my thought is, comparing with the real etcd redeploy cert playbook[1], the way we called etcd cert redeployment during upgrade[2] is lacking of include playbooks/common/openshift-cluster/redeploy-certificates/etcd-backup.yml, which will remove generated certificates. Or else we could just simply make a same confirm in roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml as https://github.com/openshift/openshift-ansible/commit/4a2c8415f8b45c897be72a3610a8ed0fd0509c68. [1]https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.7-fix/playbooks/byo/openshift-cluster/redeploy-etcd-certificates.yml#L10-L18 [2]https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.7-fix/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml#L38-L42
@Mike and Vikas, the error you hit in comment 16 and comment 19 should be another scenario caused by this etcd server.crt SAN issue, which is more similar to https://bugzilla.redhat.com/show_bug.cgi?id=1536217. We could simply reproduce it in a normal 3.7.44 ocp env. 1. Modify /etc/etcd/server.crt to only have the IP address in the SAN cd /etc/etcd/ SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \ -config /etc/etcd/ca/openssl.cnf \ -out server.csr \ -reqexts etcd_v3_req -batch -nodes \ -subj /CN=ose3-master.example.com SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \ -config /etc/etcd/ca/openssl.cnf \ -out server.crt \ -in server.csr \ -extensions etcd_v3_ca_server -batch 2. Restart etcd and atomic-openshift-master-api. Then master-api service will be failed as below: Apr 27 02:25:38 ip-172-18-8-229.ec2.internal atomic-openshift-master-api[21534]: I0427 02:25:38.616637 21534 start_master.go:538] Using images from "registry.reg-aws.openshift.com:443/openshift3/ose-<component>:v3.7.44" Apr 27 02:25:38 ip-172-18-8-229.ec2.internal openshift[21534]: Failed to dial ip-172-18-1-152.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-1-152.ec2.internal"; please retry. The root cause is that we got a bad SAN of etcd server certs in the cluster, which is originally an installation problem of 3.5 fresh install. Once we perform 3.5 -> 3.6 upgrade also with etcd upgraded to etcd-3.2.15-2.el7, we will hit BZ#1572377. So I didn't update etcd when upgrading to 3.6 to verify this 3.7 upgrade bug. After an incomplete upgrade to 3.6(2nd attempt at upgrading to 3.6 will skip etcd upgrade since etcd rpm package is already the latest one), the etcd server certs still not fixed, so it failed in 3.6->3.7 upgrade in the end.
(In reply to Gaoyun Pei from comment #20) > Tried again with the new build - > openshift-ansible-3.7.44-1.git.7.49cfcd8.el7. Still failed with the same > error. /etc/etcd/server.crt is not got updated before etcd upgrade. ... > The proposed fix in > https://github.com/openshift/openshift-ansible/commit/ > 4a2c8415f8b45c897be72a3610a8ed0fd0509c68 only removed the client cert > tarball used for masters, didn't delete the etcd server cert tarball. Right, removing tarball for server certs too seems to have done trick. Created https://github.com/openshift/openshift-ansible/pull/8173 to merge into hotfix branch. These changes would be later on cherry-picked to master and down to 3.6 branches
We're addressing this bug by rebuilding 3.7 with golang 1.8.3 instead of 1.9.
Verified on 3.7.44-2.git.0.6b061d4.el7 Installed 3.7.42 directly and upgraded the rpms to 3.7.44. Was able to start master api.
I was able to do a successful full 3.5-3.6-3.7 upgrade with no errors like this: 1. Install 3.5.5.31.67 + etcd 3.1.9 2. Upgrade using 3.6.173.0.113 + the PR from https://bugzilla.redhat.com/show_bug.cgi?id=1572377#c9 3. Migrate etcd storage using the openshift-ansible from comment 13 of this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13) 4. Upgrade to 3.7.44-2 (build from comment 27 of this bz) using the openshift-ansible from comment 13 of this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13) no cert errors - successful upgrade all of the way
comment 28 is rpm only. did not run the containerized equivalent
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1261