Bug 1536217
| Summary: | Need to validate etcd serving certs before 3.9 upgrade | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Scott Dodson <sdodson> |
| Component: | Cluster Version Operator | Assignee: | Andrew Butcher <abutcher> |
| Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 3.9.0 | CC: | abutcher, aos-bugs, dmace, jokerman, jupierce, mmccomas, rhowe, sdodson, wmeng |
| Target Milestone: | --- | Keywords: | DeliveryBlocker |
| Target Release: | 3.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-03-28 14:20:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Scott Dodson
2018-01-18 22:27:09 UTC
Example sanitized log message encountered after restarting with 3.9 api server. Jan 18 22:10:20 ip-1-3-5-1.ec2.internal openshift[28067]: Failed to dial ip-1-3-5-1.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-1-3-5-1.ec2.internal"; please retry. We need to validate the history of how we've configured the client and the history of the SANs we've been applying to the etcd serving certs, but I imagine ultimately the best path forward is to update all client config to connect to the ip rather than the hostname. Dan, Can I get your opinion on the best way forward here? Okay, through talking with Jordan and Scott, here's what we found: * As of Go 1.8, the SAN list is (correctly) authoritative, and the absence of the etcd hostname in the SAN list results in the rejection we're seeing * Older origin builds predating Go 1.8 worked by coincidence (incorrectly falling back to the CN even in the presence of SAN; e.g. not treating SAN as authoritative) Bottom line is our cert SAN sections are essentially invalid and need updated to include the hostname. SAN was corrected in https://github.com/openshift/openshift-ansible/commit/33e181c39d5024ecd226567139a7b0d36683bf2c so clusters created prior to 3.6 have incorrect etcd serving certificates. Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/a97704f4db140037a67aeb2ca45254e4ecffed6e Merge pull request #6859 from abutcher/bz1536217 Automatic merge from submit-queue. Bug 1536217: Need to validate etcd serving certs before 3.9 upgrade @Andrew Try to re-produce the issue but failed. Go through your comments in this bug, I got following info, could u help me to correct them so that I can proceed on this bug, Thx! 1) In 3.7-3.9 upgrade, openshift-ansible need add a check in upgrade_control_plane to ensure etcd's hostname used in server.crt, or else, openshift-ansible will redeploy certificate to generate a new server.crt with hostname.(Just as pr6859 in comment5) 2) In 3.7-3.9 upgrade(for a fresh installation's 3.7), there will be no issue about "validates the CN/SAN of the endpoint" described in Scott's description, because from 3.6's installation, openshift-ansible generate etcd's server.crt with hostname(Just as pr in comment4) So, if I need build a 3.7 cluster with a etcd crt signed with ip? how to do that? @liujia The easiest way to test this is probably to install a single master + etcd environment using 3.7, then re-generate a bad certificate like this cd /etc/etcd/ SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \ -config /etc/etcd/ca/openssl.cnf \ -out server.csr \ -reqexts etcd_v3_req -batch -nodes \ -subj /CN=ose3-master.example.com SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \ -config /etc/etcd/ca/openssl.cnf \ -out server.crt \ -in server.csr \ -extensions etcd_v3_ca_server -batch Verify the SAN only has IP address openssl x509 -in server.crt -text -noout Then, you can perform a 3.9 upgrade using old playbooks which should fail because the 3.9 API server cannot connect to etcd with a cert error. Using new playbooks, the upgrade should automatically add the hostname to the SAN, and the upgrade should be successful. (In reply to Scott Dodson from comment #8) > @liujia > > The easiest way to test this is probably to install a single master + etcd > environment using 3.7, then re-generate a bad certificate like this > > cd /etc/etcd/ > SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \ > -config /etc/etcd/ca/openssl.cnf \ > -out server.csr \ > -reqexts etcd_v3_req -batch -nodes \ > -subj /CN=ose3-master.example.com > > SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \ > -config /etc/etcd/ca/openssl.cnf \ > -out server.crt \ > -in server.csr \ > -extensions etcd_v3_ca_server -batch > > Verify the SAN only has IP address > openssl x509 -in server.crt -text -noout > > Then, you can perform a 3.9 upgrade using old playbooks which should fail > because the 3.9 API server cannot connect to etcd with a cert error. > > Using new playbooks, the upgrade should automatically add the hostname to > the SAN, and the upgrade should be successful. Scott, Thx a lot for your helpful message. Reproduced on openshift-ansible-3.9.0-0.24.0.git.0.735690f.el7.noarch. Steps: 1, install ocp 3.7 with external etcd. 2, change etcd server crt to sign with ip just according to Scott's way in comment 8 3, trigger upgrade and restart atomic-openshift-master-api failed. aster-etcd-1:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match qe-jliu-rp-master-etcd-1"; please retry. Version:openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch Steps: 1, install ocp 3.7 with external etcd. 2, change etcd server crt to sign with ip just according to Scott's way in comment 8 3, trigger upgrade, succeed. 4, restart atomic-openshift-master-api, succeed. 5, check etcd server's SAN added hostname # cat server.crt |grep -A 1 "Subject Alternative Name" X509v3 Subject Alternative Name: IP Address:10.240.0.37, DNS:qe-jliu-rpmr-master-etcd-1 Added case ocp-17925 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489 This also apparently happens when simply upgrading etcd to etcd-3.2.15-2 which rebuilt etcd using golang 1.9. |