Bug 1572377

Summary: 3.5->3.6 Upgrade fails: Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match <hostname>
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: InstallerAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED ERRATA QA Contact: Gaoyun Pei <gpei>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.6.0CC: aos-bugs, dlbewley, gpei, jokerman, mmccomas, vlaad, vrutkovs, wmeng
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: etcd 3.2.15 is compiled with go 1.9, which tightens the certificate security Consequence: Certificates, created without SAN entry, are now treated as invalid in etcd 3.2.15 Fix: Certificates are re-generated on 3.5 -> 3.6 upgrade so that these would be compatible with new etcd Result: openshift-ansible generates valid certificates for etcd 3.2.15
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-07 20:20:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible log, additional /etc capture
none
3.7 upgrade failure ansible log and /etc files
none
replacement: inventory, /etc/ files and ansible log for original failure on 3.5 to 3.6 upgrade none

Description Mike Fiedler 2018-04-26 21:31:32 UTC
Description of problem:

I've hit this twice on upgrading 3.5 to 3.6 and once upgrading 3.6 to 3.7 as reported here:  https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c16

Opening this bz to capture the issue separately from bz1565762 since it happens in the 3.5->3.6 upgrade. 

The full error message is below and the following items are in the attached tarball:

- inventory used for upgrade
- ansible log for upgrade (not -vvv unfortunately, sorry)
- tar of /etc/origin and /etc/etcd before the upgrade
- tar of /etc/origin and /etc/etcd after the failed upgrade

Performing the upgrade with the fix for:  https://bugzilla.redhat.com/show_bug.cgi?id=1567857 installed 

Version-Release number of the following components:
[root@ip-172-18-4-245 logs_config]# rpm -q openshift-ansible
openshift-ansible-3.6.173.0.113-1.git.0.8a42ef5.el7.noarch
[root@ip-172-18-4-245 logs_config]# rpm -q ansible
ansible-2.4.2.0-2.el7.noarch
[root@ip-172-18-4-245 logs_config]# ansible --version
ansible 2.4.2.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Aug  2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]

How reproducible:  twice in 2 attempts

Steps to Reproduce:
1. Single node cluster on AWS - install 3.5.5.31.67.  Verify the cluster is operational.
2. Clone openshift-ansible release-1.5 to get the fix for bug 1567857
3. ansible-playbook -i inv openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml 

Actual results:
2018-04-26 17:12:46,300 p=64174 u=root |  TASK [etcd_upgrade : Verify cluster is healthy] ******************************************************************************************************************************************************************************************************************************************
2018-04-26 17:12:48,814 p=64174 u=root |  FAILED - RETRYING: Verify cluster is healthy (3 retries left).
2018-04-26 17:12:59,350 p=64174 u=root |  FAILED - RETRYING: Verify cluster is healthy (2 retries left).
2018-04-26 17:13:09,893 p=64174 u=root |  FAILED - RETRYING: Verify cluster is healthy (1 retries left).
2018-04-26 17:13:20,453 p=64174 u=root |  fatal: [ec2-34-226-216-106.compute-1.amazonaws.com]: FAILED! => {
    "attempts": 3, 
    "changed": true, 
    "cmd": [
        "etcdctl", 
        "--cert-file", 
        "/etc/etcd/peer.crt", 
        "--key-file", 
        "/etc/etcd/peer.key", 
        "--ca-file", 
        "/etc/etcd/ca.crt", 
        "-C", 
        "https://ip-172-18-4-245.ec2.internal:2379", 
        "cluster-health"
    ], 
    "delta": "0:00:00.047958", 
    "end": "2018-04-26 17:13:20.391596", 
    "rc": 4, 
    "start": "2018-04-26 17:13:20.343638"
}

STDOUT:

cluster may be unhealthy: failed to list members


STDERR:

Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal

error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal


MSG:

non-zero return code



Expected results:

Additional info:

Comment 1 Mike Fiedler 2018-04-26 21:33:48 UTC
Created attachment 1427401 [details]
invnetory, ansible log, backups of /etc dirs

Comment 2 Mike Fiedler 2018-04-26 22:03:18 UTC
Created attachment 1427402 [details]
ansible log, additional /etc capture

In both upgrade attempts where this error was accounted, the 2nd attempt at upgrading to 3.6 succeeded.  Attaching the ansible log and /etc/origin, /etc/etcd tars after the successful attempt.

Comment 3 Mike Fiedler 2018-04-26 23:00:35 UTC
As with  https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c16, trying to further upgrade the cluster to 3.7 after the 2nd attempt to upgrade to 3.6 succeeded, the 3.7 upgrade attempt failed.

The master-api service failed to restart:

TASK [Restart master API] ****************************************************************************************************************************************************************************************************************************************************************
fatal: [ec2-34-226-216-106.compute-1.amazonaws.com]: FAILED! => {"changed": false, "msg": "Unable to restart service atomic-openshift-master-api: Job for atomic-openshift-master-api.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-api.service\" and \"journalctl -xe\" for details.\n"}


with the subject messages in the api-server log:

Apr 26 18:58:48 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[3359]: I0426 18:58:48.413123    3359 start_master.go:530] Starting master on 0.0.0.0:8443 (v3.7.44)
Apr 26 18:58:48 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[3359]: I0426 18:58:48.413131    3359 start_master.go:531] Public master address is https://ec2-34-226-216-106.compute-1.amazonaws.com:8443
Apr 26 18:58:48 ip-172-18-4-245.ec2.internal atomic-openshift-master-api[3359]: I0426 18:58:48.413146    3359 start_master.go:538] Using images from "registry.reg-aws.openshift.com:443/openshift3/ose-<component>:v3.7.44"
Apr 26 18:58:48 ip-172-18-4-245.ec2.internal openshift[3359]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please retry.
Apr 26 18:58:48 ip-172-18-4-245.ec2.internal openshift[3359]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please retry.
Apr 26 18:58:54 ip-172-18-4-245.ec2.internal openshift[3359]: Failed to dial ip-172-18-4-245.ec2.internal:2379: connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match ip-172-18-4-245.ec2.internal"; please retry.

I'll leave the cluster in this state and attach the same ansible log and /etc archives

Comment 4 Mike Fiedler 2018-04-26 23:04:18 UTC
Created attachment 1427403 [details]
3.7 upgrade failure ansible log and /etc files

Comment 5 Mike Fiedler 2018-04-26 23:05:31 UTC
Created attachment 1427404 [details]
replacement:  inventory, /etc/ files and ansible log for original failure on 3.5 to 3.6 upgrade

Comment 6 Gaoyun Pei 2018-04-27 07:53:26 UTC
After etcd upgraded to etcd-3.2.15-2, if only IP address configured in the SAN of etcd server.crt, then only IP address is acceptable as endpoints in etcdctl command.

[root@ip-172-18-1-165 ~]# rpm -q etcd
etcd-3.2.15-2.el7.x86_64

[root@ip-172-18-1-165 ~]# openssl x509 -in /etc/etcd/server.crt -noout -text | grep -A1 Alternative
            X509v3 Subject Alternative Name: 
                IP Address:172.18.1.165

[root@ip-172-18-1-165 ~]# etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://ip-172-18-1-165.ec2.internal:2379 cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-1-165.ec2.internal

error #0: x509: certificate is not valid for any names, but wanted to match ip-172-18-1-165.ec2.internal

[root@ip-172-18-1-165 ~]# etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://172.18.1.165:2379 cluster-health
member 8e9e05c52164694d is healthy: got healthy result from https://172.18.1.165:2379
cluster is healthy


So we need to back-port the fix for BZ#1565762 to 3.6 ASAP to make sure the etcd server.crt SAN got fixed during 3.5-> 3.6 upgrade.

Comment 7 Vadim Rutkovsky 2018-04-27 12:14:07 UTC
Upgrade from 3.5 to 3.6 cleanly (3.6 upgrade did fail though). I'll certainly cherrypick the fix from  BZ#1565762 to 3.6 - just concerned that its a different problem.

Not sure why this is happening, why did new etcd was installed in 3.6? Was it a containerized / system-containers / all RPM install?

Comment 8 Mike Fiedler 2018-04-27 12:37:27 UTC
The install is all RPM.  I ensured etcd 3.1.9 was installed at the beginning, but I did not check it after the first upgrade to 3.6 failed or after the second one succeeded.   I will watch that on the next attempt.   Let me know if it is critical to try that to proceed with the fix.

Comment 9 Vadim Rutkovsky 2018-04-27 15:08:40 UTC
This happens when 3.5 cluster is setup with etcd < 3.2.15

Created PR https://github.com/openshift/openshift-ansible/pull/8181 and tested this using:

* 3.5 cluster with 'etcd_version=3.1.9'
* commented etcd_version, upgraded to 3.6

Comment 10 Mike Fiedler 2018-04-27 20:19:30 UTC
I was able to do a successful full 3.5-3.6-3.7 upgrade with no errors like this:

1. Install 3.5.5.31.67 + etcd 3.1.9

2. Upgrade using 3.6.173.0.113 + the PR from comment 9 of this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1572377#c9)

3. Migrate etcd storage using the openshift-ansible from https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13

4. Upgrade to 3.7.44-2 (build from https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c26) using the openshift-ansible from https://bugzilla.redhat.com/show_bug.cgi?id=1565762#c13

no cert errors - successful upgrade all of the way

Comment 11 Mike Fiedler 2018-04-27 20:24:53 UTC
comment 10 is rpm only.  did not run the containerized equivalent

Comment 12 Scott Dodson 2018-05-01 01:08:32 UTC
Merged, fixes in openshift-ansible-3.6.173.0.113-1.git.13.f3b3b1d.el7

Comment 14 Gaoyun Pei 2018-05-02 07:28:02 UTC
Verify this bug with openshift-ansible-3.6.173.0.113-1.git.13.f3b3b1d.el7.noarch

Upgrade 3.5.5.31.67 + etcd 3.1.9 to 3.6.173.0.113 + etcd 3.2.15, during etcd upgrade tasks, etcd server.crt was updated to new one which has both IP address and DNS name in the SAN, no error in upgrade.

Comment 17 errata-xmlrpc 2018-05-07 20:20:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1335