It looks like the etcd client in 3.9 now validates the CN/SAN of the endpoint against the client url that's specified. It looks like certs that we've generated in the past were not signed with the hostname and therefore they stop validating upon upgrade. A workaround is to change from hostnames to ip addresses but we need to look into this more.
Example sanitized log message encountered after restarting with 3.9 api server. Jan 18 22:10:20 ip-1-3-5-1.ec2.internal openshift[28067]: Failed to dial ip-1-3-5-1.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-1-3-5-1.ec2.internal"; please retry. We need to validate the history of how we've configured the client and the history of the SANs we've been applying to the etcd serving certs, but I imagine ultimately the best path forward is to update all client config to connect to the ip rather than the hostname.
Dan, Can I get your opinion on the best way forward here?
Okay, through talking with Jordan and Scott, here's what we found: * As of Go 1.8, the SAN list is (correctly) authoritative, and the absence of the etcd hostname in the SAN list results in the rejection we're seeing * Older origin builds predating Go 1.8 worked by coincidence (incorrectly falling back to the CN even in the presence of SAN; e.g. not treating SAN as authoritative) Bottom line is our cert SAN sections are essentially invalid and need updated to include the hostname.
SAN was corrected in https://github.com/openshift/openshift-ansible/commit/33e181c39d5024ecd226567139a7b0d36683bf2c so clusters created prior to 3.6 have incorrect etcd serving certificates.
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/a97704f4db140037a67aeb2ca45254e4ecffed6e Merge pull request #6859 from abutcher/bz1536217 Automatic merge from submit-queue. Bug 1536217: Need to validate etcd serving certs before 3.9 upgrade
@Andrew Try to re-produce the issue but failed. Go through your comments in this bug, I got following info, could u help me to correct them so that I can proceed on this bug, Thx! 1) In 3.7-3.9 upgrade, openshift-ansible need add a check in upgrade_control_plane to ensure etcd's hostname used in server.crt, or else, openshift-ansible will redeploy certificate to generate a new server.crt with hostname.(Just as pr6859 in comment5) 2) In 3.7-3.9 upgrade(for a fresh installation's 3.7), there will be no issue about "validates the CN/SAN of the endpoint" described in Scott's description, because from 3.6's installation, openshift-ansible generate etcd's server.crt with hostname(Just as pr in comment4) So, if I need build a 3.7 cluster with a etcd crt signed with ip? how to do that?
@liujia The easiest way to test this is probably to install a single master + etcd environment using 3.7, then re-generate a bad certificate like this cd /etc/etcd/ SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \ -config /etc/etcd/ca/openssl.cnf \ -out server.csr \ -reqexts etcd_v3_req -batch -nodes \ -subj /CN=ose3-master.example.com SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \ -config /etc/etcd/ca/openssl.cnf \ -out server.crt \ -in server.csr \ -extensions etcd_v3_ca_server -batch Verify the SAN only has IP address openssl x509 -in server.crt -text -noout Then, you can perform a 3.9 upgrade using old playbooks which should fail because the 3.9 API server cannot connect to etcd with a cert error. Using new playbooks, the upgrade should automatically add the hostname to the SAN, and the upgrade should be successful.
(In reply to Scott Dodson from comment #8) > @liujia > > The easiest way to test this is probably to install a single master + etcd > environment using 3.7, then re-generate a bad certificate like this > > cd /etc/etcd/ > SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \ > -config /etc/etcd/ca/openssl.cnf \ > -out server.csr \ > -reqexts etcd_v3_req -batch -nodes \ > -subj /CN=ose3-master.example.com > > SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \ > -config /etc/etcd/ca/openssl.cnf \ > -out server.crt \ > -in server.csr \ > -extensions etcd_v3_ca_server -batch > > Verify the SAN only has IP address > openssl x509 -in server.crt -text -noout > > Then, you can perform a 3.9 upgrade using old playbooks which should fail > because the 3.9 API server cannot connect to etcd with a cert error. > > Using new playbooks, the upgrade should automatically add the hostname to > the SAN, and the upgrade should be successful. Scott, Thx a lot for your helpful message.
Reproduced on openshift-ansible-3.9.0-0.24.0.git.0.735690f.el7.noarch. Steps: 1, install ocp 3.7 with external etcd. 2, change etcd server crt to sign with ip just according to Scott's way in comment 8 3, trigger upgrade and restart atomic-openshift-master-api failed. aster-etcd-1:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match qe-jliu-rp-master-etcd-1"; please retry.
Version:openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch Steps: 1, install ocp 3.7 with external etcd. 2, change etcd server crt to sign with ip just according to Scott's way in comment 8 3, trigger upgrade, succeed. 4, restart atomic-openshift-master-api, succeed. 5, check etcd server's SAN added hostname # cat server.crt |grep -A 1 "Subject Alternative Name" X509v3 Subject Alternative Name: IP Address:10.240.0.37, DNS:qe-jliu-rpmr-master-etcd-1
Added case ocp-17925
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489
This also apparently happens when simply upgrading etcd to etcd-3.2.15-2 which rebuilt etcd using golang 1.9.