Bug 1565762 - Need to validate etcd server certs have proper SAN prior to upgrading to atomic-openshift-3.7.44
Summary: Need to validate etcd server certs have proper SAN prior to upgrading to atom...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.z
Assignee: Vadim Rutkovsky
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks: 1572763
TreeView+ depends on / blocked
 
Reported: 2018-04-10 17:40 UTC by Scott Dodson
Modified: 2021-09-09 13:39 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The original build of atomic-openshift-3.7.44 was built using Golang 1.9 which has tighter certificate validation requirements than previous versions. In environments upgraded from 3.5 to later versions certificates may not pass those validation requirements. atomic-openshift-3.7.44 has been rebuilt using Golang 1.8.3 which does not have the more restrictive validation rules avoiding this problem.
Clone Of:
: 1572763 (view as bug list)
Environment:
Last Closed: 2018-04-30 00:27:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Full ansible log when upgrading to ocp-3.7 and etcd-3.2.15-2 (535.25 KB, text/plain)
2018-04-26 11:12 UTC, Gaoyun Pei
no flags Details
Log for 3.6 -> 3.7 upgrade (152.86 KB, application/x-gzip)
2018-04-26 19:33 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1261 0 None None None 2018-04-30 00:27:53 UTC

Description Scott Dodson 2018-04-10 17:40:25 UTC
Description of problem:
etcd-3.2.15-2.el7 which shipped in RHEL 7.5 was rebuilt using golang 1.9 which has stricter certificate validation rules. Namely, that the cert must have not only a CN but a SAN that matches.

This is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1536217 where the API server in OCP 3.9 is rebuilt using golang-1.9 and had similar problems.

Version-Release number of the following components:
etcd-3.2.15-2.el7

How reproducible:
Requires an environment provisioned using playbooks prior to OCP 3.6 where we did not properly add the hostname to the SAN.

You can verify this is the case by looking at the serving cert

BAD
# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative
            X509v3 Subject Alternative Name: 
                IP Address:10.10.10.10

GOOD
# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative
            X509v3 Subject Alternative Name: 
                IP Address:10.10.10.10, DNS:master.example.com



Steps to Reproduce:
1. Provision a cluster using 3.5
2. Upgrade to 3.6
3. Upgrade to 3.7

Alternatively, you could provision a 3.6 cluster, modify the certs to only have the IP address in the SAN like so, then perform a 3.7 upgrade.
cd /etc/etcd/
SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \
-config /etc/etcd/ca/openssl.cnf \
-out server.csr \
-reqexts etcd_v3_req -batch -nodes \
-subj /CN=ose3-master.example.com

SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \
-config /etc/etcd/ca/openssl.cnf \
-out server.crt \
-in server.csr \
-extensions etcd_v3_ca_server -batch

Actual results:
Playbooks abort because etcdctl cannot verify cluster health

TASK [etcd : Restart etcd] *****************************************************
Tuesday 10 April 2018  16:17:33 +0000 (0:00:00.800)       0:06:07.732 ********* 
changed: [host.example.com]

TASK [etcd : Verify cluster is healthy] ****************************************
Tuesday 10 April 2018  16:17:34 +0000 (0:00:01.321)       0:06:09.054 ********* 
FAILED - RETRYING: Verify cluster is healthy (3 retries left).
FAILED - RETRYING: Verify cluster is healthy (2 retries left).
FAILED - RETRYING: Verify cluster is healthy (1 retries left).
fatal: [host.example.com]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://host.example.com:2379", "cluster-health"], "delta": "0:00:00.065643", "end": "2018-04-10 16:18:06.654776", "msg": "non-zero return code", "rc": 4, "start": "2018-04-10 16:18:06.589133", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-31-61-120.eu-west-1.compute.internal\n\nerror #0: x509: certificate is not valid for any names, but wanted to match host.example.com", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match host.example.com", "", "error #0: x509: certificate is not valid for any names, but wanted to match ip-172-31-61-120.eu-west-1.compute.internal"], "stdout": "cluster may be unhealthy: failed to list members", "stdout_lines": ["cluster may be unhealthy: failed to list members"]}

Expected results:
Successful upgrade

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2018-04-10 17:41:51 UTC
The fix is to backport this to 3.7, this will ensure during the upgrade we check for bad certs and correct them.

https://github.com/openshift/openshift-ansible/pull/6859

Comment 2 Scott Dodson 2018-04-10 19:48:28 UTC
Workaround:

On each host downgrade/upgrade to etcd-3.2.15-1.el7

yum downgrade etcd-3.2.15-1.el7 && systemctl restart etcd
yum upgrade etcd-3.2.15-1.el7 && systemctl restart etcd

Also, note that this does not affect actual operation of OCP 3.7 or earlier versions but it will prevent the /usr/bin/etcdctl from interacting with the cluster as the upgrade playbooks do.

Comment 3 Scott Dodson 2018-04-11 02:27:41 UTC
We should also backport https://github.com/openshift/openshift-ansible/pull/6926 so that it automatically determines which host is the etcd CA host.

Comment 7 Gaoyun Pei 2018-04-26 11:10:41 UTC
Could reproduce this issue with released 3.7 openshift-ansible - openshift-ansible-3.7.42-1.git.2.9ee4e71.el7.noarch.rpm.

Reproduce steps:
1. Prepare v3.5.5.31.67 ocp cluster with etcd-3.1.9-2 installed,
the SAN of etcd server.crt only has IP address
[root@ip-172-18-7-242 ~]# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative
            X509v3 Subject Alternative Name: 
                IP Address:172.18.7.242
[root@ip-172-18-7-242 ~]# rpm -q etcd
etcd-3.1.9-2.el7.x86_64

2. Upgrade it to v3.6.173.0.113 without upgrading etcd

3. Upgrade it to v3.7.42, also upgrade etcd to etcd-3.2.15-2 using openshift-ansible-3.7.42-1.git.2.9ee4e71.el7.noarch.rpm. 
Upgrade fails as below:
TASK [etcd : Verify cluster is healthy] *************************************************************************************************************************************
FAILED - RETRYING: Verify cluster is healthy (3 retries left).
FAILED - RETRYING: Verify cluster is healthy (2 retries left).
FAILED - RETRYING: Verify cluster is healthy (1 retries left).
fatal: [ec2-52-90-234-197.compute-1.amazonaws.com]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://ip-172-18-7-242.ec2.internal:2379", "cluster-health"], "delta": "0:00:00.050016", "end": "2018-04-26 04:46:52.074893", "failed": true, "msg": "non-zero return code", "rc": 4, "start": "2018-04-26 04:46:52.024877", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal\n\nerror #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal", "", "error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-7-242.ec2.internal"], "stdout": "cluster may be unhealthy: failed to list members", "stdout_lines": ["cluster may be unhealthy: failed to list members"]}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry



Verify this bug with the above steps using openshift-ansible-3.7.44-1.git.6.f5308bb.el7.
Upgrade fails as the same error, some post checks on the etcd host.
[root@ip-172-18-7-242 ~]# rpm -q etcd
etcd-3.2.15-2.el7.x86_64
[root@ip-172-18-7-242 ~]# ll /etc/etcd/
total 56
drwx------. 5 root root  212 Apr 26 05:01 ca
-rw-------. 1 etcd etcd 1895 Apr 26 01:17 ca.crt
-rw-r--r--. 1 root root  654 Apr 26 05:05 etcd.conf
-rw-r--r--. 1 root root  539 Apr 26 01:17 etcd.conf.23499.2018-04-26@04:28:09~
-rw-r--r--. 1 root root  567 Apr 26 04:28 etcd.conf.23646.2018-04-26@04:28:19~
-rw-r--r--. 1 root root 1341 Feb 21  2017 etcd.conf.5198.2018-04-26@01:17:27~
-rw-r--r--. 1 root root 1686 Feb  6 09:23 etcd.conf.rpmnew
drwx------. 4 root root  221 Apr 26 01:20 generated_certs
-rw-------. 1 etcd etcd 5902 Apr 26 01:17 peer.crt
-rw-r--r--. 1 root root 1001 Apr 26 01:17 peer.csr
-rw-------. 1 etcd etcd 1704 Apr 26 01:17 peer.key
-rw-------. 1 etcd etcd 5858 Apr 26 01:17 server.crt
-rw-r--r--. 1 root root 1001 Apr 26 01:17 server.csr
-rw-------. 1 etcd etcd 1704 Apr 26 01:17 server.key
[root@ip-172-18-7-242 ~]# ll /etc/etcd/generated_certs/
total 36
-rw-r--r--. 1 root root 14415 Apr 26 01:17 etcd_ca.tgz
drwx------. 2 root root   122 Apr 26 01:17 etcd-ip-172-18-7-242.ec2.internal
-rw-r--r--. 1 root root 10743 Apr 26 01:17 etcd-ip-172-18-7-242.ec2.internal.tgz
drwx------. 2 root root   122 Apr 26 01:20 openshift-master-ip-172-18-6-109.ec2.internal
-rw-r--r--. 1 root root  6621 Apr 26 01:20 openshift-master-ip-172-18-6-109.ec2.internal.tgz
[root@ip-172-18-7-242 ~]# ll /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal/
total 36
-rw-r--r--. 3 root root 1895 Apr 26 01:17 ca.crt
-rw-r--r--. 1 root root 5976 Apr 26 05:01 peer.crt
-rw-r--r--. 1 root root 1041 Apr 26 05:01 peer.csr
-rw-r--r--. 1 root root 1708 Apr 26 05:01 peer.key
-rw-r--r--. 1 root root 5933 Apr 26 05:01 server.crt
-rw-r--r--. 1 root root 1041 Apr 26 05:01 server.csr
-rw-r--r--. 1 root root 1704 Apr 26 05:01 server.key
[root@ip-172-18-7-242 ~]# openssl x509 -in /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal/server.crt -noout -text | grep -A1 Alternative
            X509v3 Subject Alternative Name: 
                IP Address:172.18.7.242, DNS:ip-172-18-7-242.ec2.internal
[root@ip-172-18-7-242 ~]# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative
            X509v3 Subject Alternative Name: 
                IP Address:172.18.7.242

The /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal/server.crt was generated as expected, but /etc/etcd/server.crt was not replaced by the new one.

Seems the tarball of new generated etcd certs was not updated during etcd certificates redeployment.

TASK [etcd : Create a tarball of the etcd certs] ****************************************************************************************************************************
ok: [ec2-52-90-234-197.compute-1.amazonaws.com -> ec2-52-90-234-197.compute-1.amazonaws.com] => {"changed": false, "cmd": "tar -czvf /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal.tgz\n -C /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal .", "failed": false, "rc": 0, "stdout": "skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal.tgz exists", "stdout_lines": ["skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-7-242.ec2.internal.tgz exists"]}

Full upgrade log attached.

Comment 8 Gaoyun Pei 2018-04-26 11:12:23 UTC
Created attachment 1427146 [details]
Full ansible log when upgrading to ocp-3.7 and etcd-3.2.15-2

Comment 9 Vadim Rutkovsky 2018-04-26 11:56:17 UTC
Not entirely sure if these steps are correct - 3.5 should be installed with etcd v2.x. This is probably why existing certs are not being replaced.

I'll prepare a PR to make sure certs from /etc/etcd/generated_certs are always copied to /etc/etcd/server.crt

Scott, does that seem correct to you?

Comment 10 Scott Dodson 2018-04-26 12:34:41 UTC
It should be fine if etcd-3.1.9 was the starting state. The problem is how the certificates were generated by openshift-ansible in the 3.5 playbooks which it looks like QE has accurately reproduced based on the first output in comment 7.

Yes, I think making sure that the server.crt is always updated when we call the playbook is fine. When I'd tested your PR that was a problem I ran into and I added this commit which resolved the problem for me.

https://github.com/openshift/openshift-ansible/pull/7914/commits/1c96402177ddc2c9887780047e7a6ed1554b603b

BTW, the build openshift-ansible-3.7.44-1.git.6.f5308bb.el7 comes from this branch https://github.com/openshift/openshift-ansible/tree/openshift-ansible-3.7-fix because while the pull request referenced here has been merged into release-3.7 we had to 'hotfix' this atop the 3.7.44 tag.

Comment 11 Vadim Rutkovsky 2018-04-26 12:38:45 UTC
(In reply to Scott Dodson from comment #10)
> Yes, I think making sure that the server.crt is always updated when we call
> the playbook is fine. When I'd tested your PR that was a problem I ran into
> and I added this commit which resolved the problem for me.
> 
> https://github.com/openshift/openshift-ansible/pull/7914/commits/
> 1c96402177ddc2c9887780047e7a6ed1554b603b

Right, that should do the trick.

> BTW, the build openshift-ansible-3.7.44-1.git.6.f5308bb.el7 comes from this
> branch
> https://github.com/openshift/openshift-ansible/tree/openshift-ansible-3.7-
> fix because while the pull request referenced here has been merged into
> release-3.7 we had to 'hotfix' this atop the 3.7.44 tag.

Okay, I'll check if 1c96402177ddc2c9887780047e7a6ed1554b603b fixes this. Moving to MODIFIED, as its not yet released

Comment 13 Scott Dodson 2018-04-26 16:48:07 UTC
Additional fix in openshift-ansible-3.7.44-1.git.7.49cfcd8.el7	

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=678900

Comment 16 Mike Fiedler 2018-04-26 19:33:06 UTC
First attempt at upgrading failed with this in the master-api log:

Apr 26 15:01:33 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: I0426 15:01:33.840114    2930 start_master.go:530] Starting master on 0.0.0.0:8443 (v3.7.44)                                                                                 │····························
Apr 26 15:01:33 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: I0426 15:01:33.840131    2930 start_master.go:531] Public master address is https://ec2-34-226-216-106.compute-1.amazonaws.com:8443                                          │····························
Apr 26 15:01:33 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: I0426 15:01:33.840145    2930 start_master.go:538] Using images from "registry.reg-aws.openshift.com:443/openshift3/ose-<component>:v3.7.44"                                 │····························
Apr 26 15:01:33 ip-172-18-4-245.ec2.internal openshift[2930]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please │····························
Apr 26 15:01:33 ip-172-18-4-245.ec2.internal openshift[2930]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please │····························
Apr 26 15:01:39 ip-172-18-4-245.ec2.internal openshift[2930]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please │····························
Apr 26 15:02:09 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: F0426 15:02:09.549256    2930 start_api.go:67] [could not reach etcd(v2): client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for │····························
Apr 26 15:02:09 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[2930]: , grpc: timed out when dialing]                                                                                                                                              │····························
Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a                                                                                                               │····························
Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: Failed to start Atomic OpenShift Master API.                                                                                                                                                        │····························
Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: Unit atomic-openshift-master-api.service entered failed state.                                                                                                                                      │····························
Apr 26 15:02:09 ip-172-18-4-245.ec2.internal systemd[1]: atomic-openshift-master-api.service failed.  


Attaching ansible log (it is the last run in this log).

Going to try again with a fresh cluster.   This cluster went through some hardship getting to the point where it was ready to upgrade to 3.7.

Comment 17 Mike Fiedler 2018-04-26 19:33:50 UTC
Created attachment 1427394 [details]
Log for 3.6 -> 3.7 upgrade

Comment 18 Mike Fiedler 2018-04-26 21:38:57 UTC
I am seeing the error message from the api logs in comment 16 (x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal) during the 3.5->3.6 upgrade - hit it in each of 2 attempts.   Opened bug 1572377 to track that separately from this bug.

Comment 19 Vikas Laad 2018-04-27 01:34:03 UTC
I also hit the same problem as mentioned in comment #16 while upgrading from 3.6 to 3.7

Apr 26 21:26:34 ip-172-31-55-174.us-west-2.compute.internal atomic-openshift-master-api[10578]: I0426 21:26:34.322911   10578 start_master.go:538] Using images from "registry.reg
-aws.openshift.com:443/openshift3/ose-<component>:v3.7.44"
Apr 26 21:26:34 ip-172-31-55-174.us-west-2.compute.internal openshift[10578]: Failed to dial ip-172-31-55-174.us-west-2.compute.internal:2379: connection error: desc = "transport
: x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal"; please retry.
Apr 26 21:26:34 ip-172-31-55-174.us-west-2.compute.internal openshift[10578]: Failed to dial ip-172-31-55-174.us-west-2.compute.internal:2379: connection error: desc = "transport
: x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal"; please retry.
Apr 26 21:26:39 ip-172-31-55-174.us-west-2.compute.internal openshift[10578]: Failed to dial ip-172-31-55-174.us-west-2.compute.internal:2379: connection error: desc = "transport
: x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal"; please retry.
Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal atomic-openshift-master-api[10578]: F0426 21:27:09.944535   10578 start_api.go:67] [could not reach etcd(v2): client: 
etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-31-55-174.us-west-2.compute.internal
Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal atomic-openshift-master-api[10578]: , grpc: timed out when dialing]
Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a
Apr 26 21:27:09 ip-172-31-55-174.us-west-2.compute.internal systemd[1]: Failed to start Atomic OpenShift Master API.

Comment 20 Gaoyun Pei 2018-04-27 06:04:16 UTC
Tried again with the new build - openshift-ansible-3.7.44-1.git.7.49cfcd8.el7. Still failed with the same error. /etc/etcd/server.crt is not got updated before etcd upgrade.


Key steps in the upgrade log:

[root@gpei-preserved 0427]# grep "Create a tarball of the etcd certs" upgrade_1 -nA1
2288:TASK [etcd : Create a tarball of the etcd certs] ****************************************************************************************************************************
2289-ok: [ec2-52-55-105-116.compute-1.amazonaws.com -> ec2-52-55-105-116.compute-1.amazonaws.com] => {"changed": false, "cmd": "tar -czvf /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal.tgz\n -C /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal .", "failed": false, "rc": 0, "stdout": "skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal.tgz exists", "stdout_lines": ["skipped, since /etc/etcd/generated_certs/etcd-ip-172-18-12-112.ec2.internal.tgz exists"]}
--
2378:TASK [etcd : Create a tarball of the etcd certs] ****************************************************************************************************************************
2379-changed: [ec2-54-161-17-75.compute-1.amazonaws.com -> ec2-52-55-105-116.compute-1.amazonaws.com] => {"changed": true, "cmd": ["tar", "-czvf", "/etc/etcd/generated_certs/openshift-master-ip-172-18-13-152.ec2.internal.tgz", "-C", "/etc/etcd/generated_certs/openshift-master-ip-172-18-13-152.ec2.internal", "."], "delta": "0:00:00.014403", "end": "2018-04-26 22:23:19.793260", "failed": false, "rc": 0, "start": "2018-04-26 22:23:19.778857", "stderr": "", "stderr_lines": [], "stdout": "./\n./master.etcd-client.key\n./master.etcd-client.csr\n./master.etcd-client.crt\n./master.etcd-ca.crt", "stdout_lines": ["./", "./master.etcd-client.key", "./master.etcd-client.csr", "./master.etcd-client.crt", "./master.etcd-ca.crt"]}

[root@gpei-preserved 0427]# grep -r "Delete existing certificate tarball if certs need to be regenerated" upgrade_1 -nA1
2375:TASK [etcd : Delete existing certificate tarball if certs need to be regenerated] *******************************************************************************************
2376-changed: [ec2-54-161-17-75.compute-1.amazonaws.com -> ec2-52-55-105-116.compute-1.amazonaws.com] => {"changed": true, "failed": false, "path": "/etc/etcd/generated_certs/openshift-master-ip-172-18-13-152.ec2.internal.tgz", "state": "absent"}


The proposed fix in https://github.com/openshift/openshift-ansible/commit/4a2c8415f8b45c897be72a3610a8ed0fd0509c68 only removed the client cert tarball used for masters, didn't delete the etcd server cert tarball.


After checking the code, my thought is, comparing with the real etcd redeploy cert playbook[1], the way we called etcd cert redeployment during upgrade[2] is lacking of include playbooks/common/openshift-cluster/redeploy-certificates/etcd-backup.yml, which will remove generated certificates.

Or else we could just simply make a same confirm in roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml as https://github.com/openshift/openshift-ansible/commit/4a2c8415f8b45c897be72a3610a8ed0fd0509c68.


[1]https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.7-fix/playbooks/byo/openshift-cluster/redeploy-etcd-certificates.yml#L10-L18
[2]https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.7-fix/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml#L38-L42

Comment 21 Gaoyun Pei 2018-04-27 08:02:20 UTC
@Mike and Vikas, the error you hit in comment 16 and comment 19 should be another scenario caused by this etcd server.crt SAN issue, which is more similar to https://bugzilla.redhat.com/show_bug.cgi?id=1536217.

We could simply reproduce it in a normal 3.7.44 ocp env.

1. Modify /etc/etcd/server.crt to only have the IP address in the SAN
cd /etc/etcd/
SAN="IP: 1.2.3.4" openssl req -new -keyout server.key \
-config /etc/etcd/ca/openssl.cnf \
-out server.csr \
-reqexts etcd_v3_req -batch -nodes \
-subj /CN=ose3-master.example.com

SAN="IP: 1.2.3.4" openssl ca -name etcd_ca \
-config /etc/etcd/ca/openssl.cnf \
-out server.crt \
-in server.csr \
-extensions etcd_v3_ca_server -batch

2. Restart etcd and atomic-openshift-master-api. Then master-api service will be failed as below:
Apr 27 02:25:38 ip-172-18-8-229.ec2.internal atomic-openshift-master-api[21534]: I0427 02:25:38.616637   21534 start_master.go:538] Using images from "registry.reg-aws.openshift.com:443/openshift3/ose-<component>:v3.7.44"
Apr 27 02:25:38 ip-172-18-8-229.ec2.internal openshift[21534]: Failed to dial ip-172-18-1-152.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-1-152.ec2.internal"; please retry.


The root cause is that we got a bad SAN of etcd server certs in the cluster, which is originally an installation problem of 3.5 fresh install. 

Once we perform 3.5 -> 3.6 upgrade also with etcd upgraded to etcd-3.2.15-2.el7, we will hit BZ#1572377. So I didn't update etcd when upgrading to 3.6 to verify this 3.7 upgrade bug. 

After an incomplete upgrade to 3.6(2nd attempt at upgrading to 3.6 will skip etcd upgrade since etcd rpm package is already the latest one), the etcd server certs still not fixed, so it failed in 3.6->3.7 upgrade in the end.

Comment 23 Vadim Rutkovsky 2018-04-27 11:50:41 UTC
(In reply to Gaoyun Pei from comment #20)
> Tried again with the new build -
> openshift-ansible-3.7.44-1.git.7.49cfcd8.el7. Still failed with the same
> error. /etc/etcd/server.crt is not got updated before etcd upgrade.

...

> The proposed fix in
> https://github.com/openshift/openshift-ansible/commit/
> 4a2c8415f8b45c897be72a3610a8ed0fd0509c68 only removed the client cert
> tarball used for masters, didn't delete the etcd server cert tarball.


Right, removing tarball for server certs too seems to have done trick.
Created https://github.com/openshift/openshift-ansible/pull/8173 to merge into hotfix branch.

These changes would be later on cherry-picked to master and down to 3.6 branches

Comment 24 Brenton Leanhardt 2018-04-27 18:21:58 UTC
We're addressing this bug by rebuilding 3.7 with golang 1.8.3 instead of 1.9.

Comment 25 Vikas Laad 2018-04-27 18:24:18 UTC
Verified on 3.7.44-2.git.0.6b061d4.el7

Installed 3.7.42 directly and upgraded the rpms to 3.7.44. Was able to start master api.

Comment 28 Mike Fiedler 2018-04-27 20:17:08 UTC
I was able to do a successful full 3.5-3.6-3.7 upgrade with no errors like this:

1. Install 3.5.5.31.67 + etcd 3.1.9
2. Upgrade using 3.6.173.0.113 + the PR from https://bugzilla.redhat.com/show_bug.cgi?id=1572377#c9
3. Migrate etcd storage using the openshift-ansible from comment 13 of this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13)
4. Upgrade to 3.7.44-2 (build from comment 27 of this bz) using the openshift-ansible from comment 13 of this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13)

no cert errors - successful upgrade all of the way

Comment 29 Mike Fiedler 2018-04-27 20:25:07 UTC
comment 28 is rpm only.  did not run the containerized equivalent

Comment 31 errata-xmlrpc 2018-04-30 00:27:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1261


Note You need to log in before you can comment on or make changes to this bug.