Bug 1536217 - Need to validate etcd serving certs before 3.9 upgrade
Summary: Need to validate etcd serving certs before 3.9 upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.0
Assignee: Andrew Butcher
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-18 22:27 UTC by Scott Dodson
Modified: 2018-06-15 15:07 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-28 14:20:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:20:59 UTC

Description Scott Dodson 2018-01-18 22:27:09 UTC
It looks like the etcd client in 3.9 now validates the CN/SAN of the endpoint against the client url that's specified. It looks like certs that we've generated in the past were not signed with the hostname and therefore they stop validating upon upgrade.

A workaround is to change from hostnames to ip addresses but we need to look into this more.

Comment 1 Scott Dodson 2018-01-18 22:58:22 UTC
Example sanitized log message encountered after restarting with 3.9 api server.

Jan 18 22:10:20 ip-1-3-5-1.ec2.internal openshift[28067]: Failed to dial ip-1-3-5-1.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-1-3-5-1.ec2.internal"; please retry.

We need to validate the history of how we've configured the client and the history of the SANs we've been applying to the etcd serving certs, but I imagine ultimately the best path forward is to update all client config to connect to the ip rather than the hostname.

Comment 2 Scott Dodson 2018-01-19 14:55:07 UTC
Dan,

Can I get your opinion on the best way forward here?

Comment 3 Dan Mace 2018-01-19 19:34:11 UTC
Okay, through talking with Jordan and Scott, here's what we found:

* As of Go 1.8, the SAN list is (correctly) authoritative, and the absence of the etcd hostname in the SAN list results in the rejection we're seeing
* Older origin builds predating Go 1.8 worked by coincidence (incorrectly falling back to the CN even in the presence of SAN; e.g. not treating SAN as authoritative)

Bottom line is our cert SAN sections are essentially invalid and need updated to include the hostname.

Comment 4 Andrew Butcher 2018-01-23 18:41:40 UTC
SAN was corrected in https://github.com/openshift/openshift-ansible/commit/33e181c39d5024ecd226567139a7b0d36683bf2c so clusters created prior to 3.6 have incorrect etcd serving certificates.

Comment 5 openshift-github-bot 2018-01-26 01:27:49 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/a97704f4db140037a67aeb2ca45254e4ecffed6e
Merge pull request #6859 from abutcher/bz1536217

Automatic merge from submit-queue.

Bug 1536217: Need to validate etcd serving certs before 3.9 upgrade

Comment 7 liujia 2018-01-31 09:25:51 UTC
@Andrew

Try to re-produce the issue but failed. Go through your comments in this bug, 
I got following info, could u help me to correct them so that I can proceed on this bug, Thx!

1) In 3.7-3.9 upgrade, openshift-ansible need add a check in upgrade_control_plane to ensure etcd's hostname used in server.crt, or else, openshift-ansible will redeploy certificate to generate a new server.crt with hostname.(Just as pr6859 in comment5)

2) In 3.7-3.9 upgrade(for a fresh installation's 3.7), there will be no issue about "validates the CN/SAN of the endpoint" described in Scott's description, because from 3.6's installation, openshift-ansible generate etcd's server.crt with hostname(Just as pr in comment4)

So, if I need build a 3.7 cluster with a etcd crt signed with ip? how to do that?

Comment 8 Scott Dodson 2018-02-01 14:21:02 UTC
@liujia

The easiest way to test this is probably to install a single master + etcd environment using 3.7, then re-generate a bad certificate like this

cd /etc/etcd/
SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \
-config /etc/etcd/ca/openssl.cnf \
-out server.csr \
-reqexts etcd_v3_req -batch -nodes \
-subj /CN=ose3-master.example.com

SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \
-config /etc/etcd/ca/openssl.cnf \
-out server.crt \
-in server.csr \
-extensions etcd_v3_ca_server -batch

Verify the SAN only has IP address
openssl x509 -in server.crt -text -noout

Then, you can perform a 3.9 upgrade using old playbooks which should fail because the 3.9 API server cannot connect to etcd with a cert error.

Using new playbooks, the upgrade should automatically add the hostname to the SAN, and the upgrade should be successful.

Comment 9 liujia 2018-02-02 08:09:12 UTC
(In reply to Scott Dodson from comment #8)
> @liujia
> 
> The easiest way to test this is probably to install a single master + etcd
> environment using 3.7, then re-generate a bad certificate like this
> 
> cd /etc/etcd/
> SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \
> -config /etc/etcd/ca/openssl.cnf \
> -out server.csr \
> -reqexts etcd_v3_req -batch -nodes \
> -subj /CN=ose3-master.example.com
> 
> SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \
> -config /etc/etcd/ca/openssl.cnf \
> -out server.crt \
> -in server.csr \
> -extensions etcd_v3_ca_server -batch
> 
> Verify the SAN only has IP address
> openssl x509 -in server.crt -text -noout
> 
> Then, you can perform a 3.9 upgrade using old playbooks which should fail
> because the 3.9 API server cannot connect to etcd with a cert error.
> 
> Using new playbooks, the upgrade should automatically add the hostname to
> the SAN, and the upgrade should be successful.

Scott, Thx a lot for your helpful message.

Comment 10 liujia 2018-02-02 08:15:08 UTC
Reproduced on openshift-ansible-3.9.0-0.24.0.git.0.735690f.el7.noarch.

Steps:
1, install ocp 3.7 with external etcd.
2, change etcd server crt to sign with ip just according to Scott's way in comment 8
3, trigger upgrade and restart atomic-openshift-master-api failed.

aster-etcd-1:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match qe-jliu-rp-master-etcd-1"; please retry.

Comment 11 liujia 2018-02-02 11:00:32 UTC
Version:openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch

Steps:
1, install ocp 3.7 with external etcd.
2, change etcd server crt to sign with ip just according to Scott's way in comment 8
3, trigger upgrade, succeed.
4, restart atomic-openshift-master-api, succeed.
5, check etcd server's SAN added hostname
# cat server.crt |grep -A 1 "Subject Alternative Name"
            X509v3 Subject Alternative Name: 
                IP Address:10.240.0.37, DNS:qe-jliu-rpmr-master-etcd-1

Comment 12 liujia 2018-02-05 04:42:39 UTC
Added case ocp-17925

Comment 15 errata-xmlrpc 2018-03-28 14:20:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 16 Scott Dodson 2018-04-10 17:08:59 UTC
This also apparently happens when simply upgrading etcd to etcd-3.2.15-2 which rebuilt etcd using golang 1.9.


Note You need to log in before you can comment on or make changes to this bug.