Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1536217

Summary:	Need to validate etcd serving certs before 3.9 upgrade
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	Cluster Version Operator	Assignee:	Andrew Butcher <abutcher>
Status:	CLOSED ERRATA	QA Contact:	liujia <jiajliu>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.9.0	CC:	abutcher, aos-bugs, dmace, jokerman, jupierce, mmccomas, rhowe, sdodson, wmeng
Target Milestone:	---	Keywords:	DeliveryBlocker
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-28 14:20:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Scott Dodson 2018-01-18 22:27:09 UTC

It looks like the etcd client in 3.9 now validates the CN/SAN of the endpoint against the client url that's specified. It looks like certs that we've generated in the past were not signed with the hostname and therefore they stop validating upon upgrade.

A workaround is to change from hostnames to ip addresses but we need to look into this more.

Comment 1 Scott Dodson 2018-01-18 22:58:22 UTC

Example sanitized log message encountered after restarting with 3.9 api server.

Jan 18 22:10:20 ip-1-3-5-1.ec2.internal openshift[28067]: Failed to dial ip-1-3-5-1.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-1-3-5-1.ec2.internal"; please retry.

We need to validate the history of how we've configured the client and the history of the SANs we've been applying to the etcd serving certs, but I imagine ultimately the best path forward is to update all client config to connect to the ip rather than the hostname.

Comment 2 Scott Dodson 2018-01-19 14:55:07 UTC

Dan,

Can I get your opinion on the best way forward here?

Comment 3 Dan Mace 2018-01-19 19:34:11 UTC

Okay, through talking with Jordan and Scott, here's what we found:

* As of Go 1.8, the SAN list is (correctly) authoritative, and the absence of the etcd hostname in the SAN list results in the rejection we're seeing
* Older origin builds predating Go 1.8 worked by coincidence (incorrectly falling back to the CN even in the presence of SAN; e.g. not treating SAN as authoritative)

Bottom line is our cert SAN sections are essentially invalid and need updated to include the hostname.

Comment 4 Andrew Butcher 2018-01-23 18:41:40 UTC

SAN was corrected in https://github.com/openshift/openshift-ansible/commit/33e181c39d5024ecd226567139a7b0d36683bf2c so clusters created prior to 3.6 have incorrect etcd serving certificates.

Comment 5 openshift-github-bot 2018-01-26 01:27:49 UTC

Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/a97704f4db140037a67aeb2ca45254e4ecffed6e
Merge pull request #6859 from abutcher/bz1536217

Automatic merge from submit-queue.

Bug 1536217: Need to validate etcd serving certs before 3.9 upgrade

Comment 7 liujia 2018-01-31 09:25:51 UTC

@Andrew

Try to re-produce the issue but failed. Go through your comments in this bug, 
I got following info, could u help me to correct them so that I can proceed on this bug, Thx!

1) In 3.7-3.9 upgrade, openshift-ansible need add a check in upgrade_control_plane to ensure etcd's hostname used in server.crt, or else, openshift-ansible will redeploy certificate to generate a new server.crt with hostname.(Just as pr6859 in comment5)

2) In 3.7-3.9 upgrade(for a fresh installation's 3.7), there will be no issue about "validates the CN/SAN of the endpoint" described in Scott's description, because from 3.6's installation, openshift-ansible generate etcd's server.crt with hostname(Just as pr in comment4)

So, if I need build a 3.7 cluster with a etcd crt signed with ip? how to do that?

Comment 8 Scott Dodson 2018-02-01 14:21:02 UTC

@liujia

The easiest way to test this is probably to install a single master + etcd environment using 3.7, then re-generate a bad certificate like this

cd /etc/etcd/
SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \
-config /etc/etcd/ca/openssl.cnf \
-out server.csr \
-reqexts etcd_v3_req -batch -nodes \
-subj /CN=ose3-master.example.com

SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \
-config /etc/etcd/ca/openssl.cnf \
-out server.crt \
-in server.csr \
-extensions etcd_v3_ca_server -batch

Verify the SAN only has IP address
openssl x509 -in server.crt -text -noout

Then, you can perform a 3.9 upgrade using old playbooks which should fail because the 3.9 API server cannot connect to etcd with a cert error.

Using new playbooks, the upgrade should automatically add the hostname to the SAN, and the upgrade should be successful.

Comment 9 liujia 2018-02-02 08:09:12 UTC

(In reply to Scott Dodson from comment #8)
> @liujia
> 
> The easiest way to test this is probably to install a single master + etcd
> environment using 3.7, then re-generate a bad certificate like this
> 
> cd /etc/etcd/
> SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \
> -config /etc/etcd/ca/openssl.cnf \
> -out server.csr \
> -reqexts etcd_v3_req -batch -nodes \
> -subj /CN=ose3-master.example.com
> 
> SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \
> -config /etc/etcd/ca/openssl.cnf \
> -out server.crt \
> -in server.csr \
> -extensions etcd_v3_ca_server -batch
> 
> Verify the SAN only has IP address
> openssl x509 -in server.crt -text -noout
> 
> Then, you can perform a 3.9 upgrade using old playbooks which should fail
> because the 3.9 API server cannot connect to etcd with a cert error.
> 
> Using new playbooks, the upgrade should automatically add the hostname to
> the SAN, and the upgrade should be successful.

Scott, Thx a lot for your helpful message.

Comment 10 liujia 2018-02-02 08:15:08 UTC

Reproduced on openshift-ansible-3.9.0-0.24.0.git.0.735690f.el7.noarch.

Steps:
1, install ocp 3.7 with external etcd.
2, change etcd server crt to sign with ip just according to Scott's way in comment 8
3, trigger upgrade and restart atomic-openshift-master-api failed.

aster-etcd-1:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match qe-jliu-rp-master-etcd-1"; please retry.

Comment 11 liujia 2018-02-02 11:00:32 UTC

Version:openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch

Steps:
1, install ocp 3.7 with external etcd.
2, change etcd server crt to sign with ip just according to Scott's way in comment 8
3, trigger upgrade, succeed.
4, restart atomic-openshift-master-api, succeed.
5, check etcd server's SAN added hostname
# cat server.crt |grep -A 1 "Subject Alternative Name"
            X509v3 Subject Alternative Name: 
                IP Address:10.240.0.37, DNS:qe-jliu-rpmr-master-etcd-1

Comment 12 liujia 2018-02-05 04:42:39 UTC

Added case ocp-17925

Comment 15 errata-xmlrpc 2018-03-28 14:20:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 16 Scott Dodson 2018-04-10 17:08:59 UTC

This also apparently happens when simply upgrading etcd to etcd-3.2.15-2 which rebuilt etcd using golang 1.9.