Bug 1572377
Summary: | 3.5->3.6 Upgrade fails: Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match <hostname> | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> |
Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> |
Status: | CLOSED ERRATA | QA Contact: | Gaoyun Pei <gpei> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.6.0 | CC: | aos-bugs, dlbewley, gpei, jokerman, mmccomas, vlaad, vrutkovs, wmeng |
Target Milestone: | --- | ||
Target Release: | 3.6.z | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
etcd 3.2.15 is compiled with go 1.9, which tightens the certificate security
Consequence:
Certificates, created without SAN entry, are now treated as invalid in etcd 3.2.15
Fix:
Certificates are re-generated on 3.5 -> 3.6 upgrade so that these would be compatible with new etcd
Result:
openshift-ansible generates valid certificates for etcd 3.2.15
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-05-07 20:20:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Mike Fiedler
2018-04-26 21:31:32 UTC
Created attachment 1427401 [details]
invnetory, ansible log, backups of /etc dirs
Created attachment 1427402 [details]
ansible log, additional /etc capture
In both upgrade attempts where this error was accounted, the 2nd attempt at upgrading to 3.6 succeeded. Attaching the ansible log and /etc/origin, /etc/etcd tars after the successful attempt.
As with https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c16, trying to further upgrade the cluster to 3.7 after the 2nd attempt to upgrade to 3.6 succeeded, the 3.7 upgrade attempt failed. The master-api service failed to restart: TASK [Restart master API] **************************************************************************************************************************************************************************************************************************************************************** fatal: [ec2-34-226-216-106.compute-1.amazonaws.com]: FAILED! => {"changed": false, "msg": "Unable to restart service atomic-openshift-master-api: Job for atomic-openshift-master-api.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-api.service\" and \"journalctl -xe\" for details.\n"} with the subject messages in the api-server log: Apr 26 18:58:48 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[3359]: I0426 18:58:48.413123 3359 start_master.go:530] Starting master on 0.0.0.0:8443 (v3.7.44) Apr 26 18:58:48 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[3359]: I0426 18:58:48.413131 3359 start_master.go:531] Public master address is https://ec2-34-226-216-106.compute-1.amazonaws.com:8443 Apr 26 18:58:48 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[3359]: I0426 18:58:48.413146 3359 start_master.go:538] Using images from "registry.reg-aws.openshift.com:443/openshift3/ose-<component>:v3.7.44" Apr 26 18:58:48 ip-172-18-4-245.ec2.internal openshift[3359]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please retry. Apr 26 18:58:48 ip-172-18-4-245.ec2.internal openshift[3359]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please retry. Apr 26 18:58:54 ip-172-18-4-245.ec2.internal openshift[3359]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please retry. I'll leave the cluster in this state and attach the same ansible log and /etc archives Created attachment 1427403 [details]
3.7 upgrade failure ansible log and /etc files
Created attachment 1427404 [details]
replacement: inventory, /etc/ files and ansible log for original failure on 3.5 to 3.6 upgrade
After etcd upgraded to etcd-3.2.15-2, if only IP address configured in the SAN of etcd server.crt, then only IP address is acceptable as endpoints in etcdctl command. [root@ip-172-18-1-165 ~]# rpm -q etcd etcd-3.2.15-2.el7.x86_64 [root@ip-172-18-1-165 ~]# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative X509v3 Subject Alternative Name: IP Address:172.18.1.165 [root@ip-172-18-1-165 ~]# etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://ip-172-18-1-165.ec2.internal:2379 cluster-health cluster may be unhealthy: failed to list members Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-1-165.ec2.internal error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-1-165.ec2.internal [root@ip-172-18-1-165 ~]# etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://172.18.1.165:2379 cluster-health member 8e9e05c52164694d is healthy: got healthy result from https://172.18.1.165:2379 cluster is healthy So we need to back-port the fix for BZ#1565762 to 3.6 ASAP to make sure the etcd server.crt SAN got fixed during 3.5-> 3.6 upgrade. Upgrade from 3.5 to 3.6 cleanly (3.6 upgrade did fail though). I'll certainly cherrypick the fix from BZ#1565762 to 3.6 - just concerned that its a different problem. Not sure why this is happening, why did new etcd was installed in 3.6? Was it a containerized / system-containers / all RPM install? The install is all RPM. I ensured etcd 3.1.9 was installed at the beginning, but I did not check it after the first upgrade to 3.6 failed or after the second one succeeded. I will watch that on the next attempt. Let me know if it is critical to try that to proceed with the fix. This happens when 3.5 cluster is setup with etcd < 3.2.15 Created PR https://github.com/openshift/openshift-ansible/pull/8181 and tested this using: * 3.5 cluster with 'etcd_version=3.1.9' * commented etcd_version, upgraded to 3.6 I was able to do a successful full 3.5-3.6-3.7 upgrade with no errors like this: 1. Install 3.5.5.31.67 + etcd 3.1.9 2. Upgrade using 3.6.173.0.113 + the PR from comment 9 of this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1572377#c9) 3. Migrate etcd storage using the openshift-ansible from https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13 4. Upgrade to 3.7.44-2 (build from https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c26) using the openshift-ansible from https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13 no cert errors - successful upgrade all of the way comment 10 is rpm only. did not run the containerized equivalent Merged, fixes in openshift-ansible-3.6.173.0.113-1.git.13.f3b3b1d.el7 Verify this bug with openshift-ansible-3.6.173.0.113-1.git.13.f3b3b1d.el7.noarch Upgrade 3.5.5.31.67 + etcd 3.1.9 to 3.6.173.0.113 + etcd 3.2.15, during etcd upgrade tasks, etcd server.crt was updated to new one which has both IP address and DNS name in the SAN, no error in upgrade. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1335 |