Bug 1772161 - OCP 3.9: etcd remains unhealthy with cert errors after trying to scaleup using AWS provisioning playbooks
Summary: OCP 3.9: etcd remains unhealthy with cert errors after trying to scaleup usin...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-13 19:30 UTC by Aja Lightner
Modified: 2023-03-24 16:00 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-21 12:56:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
"Bad node" - ip-172-28-249-121 (40.00 KB, application/x-tar)
2019-11-13 19:30 UTC, Aja Lightner
no flags Details
Leader of etcd (tarball) - 172.28.248.35 (290.00 KB, application/x-tar)
2019-11-13 19:35 UTC, Aja Lightner
no flags Details
Healthy etcd peer - ip-172-28-248-183 (190.00 KB, application/x-tar)
2019-11-13 19:38 UTC, Aja Lightner
no flags Details
Sosreport (15.98 MB, application/x-xz)
2019-11-13 19:46 UTC, Aja Lightner
no flags Details

Description Aja Lightner 2019-11-13 19:30:32 UTC
Created attachment 1635901 [details]
"Bad node" - ip-172-28-249-121

Description of problem:

Scale up play for etcd fails to add 3rd node to cluster (ip-172-28-249-121).

After restoring a failed etcd cluster [0], and making it a single peer cluster [1], etcd was up/running and oc commands were working. We were able to scale up one of the etcd peers. We then had 2 healthy etcd peers. 

Tried to scale up the 3rd member and received some certificate issues. Rather than redeploying the client and server ca, we decided the easier path was to delete the AWS guest and let the autoscaler scale it back up. However, the scale up failed with the error below [2].

Why is the etcd-signer different on 172-28-249-121 node. Why did it work for 2nd etcd node which was scaled yesterday too?


[0]
https://access.redhat.com/solutions/4013381

[1]
https://docs.openshift.com/container-platform/3.9/admin_guide/assembly_restore-etcd-quorum.html#cluster-restore-etcd-quorum-single-node_restore-etcd-quorum

[2]
fatal: [ip-172-28-249-121.eu-central-1.compute.internal -> ip-172-28-248-183.eu-central-1.compute.internal]: FAILED! => {
    "attempts": 3,
    "changed": true,
    "cmd": [
        "etcdctl",
        "--cert-file",
        "/etc/etcd/peer.crt",
        "--key-file",
        "/etc/etcd/peer.key",
        "--ca-file",
        "/etc/etcd/ca.crt",
        "-C",
        "https://172.28.248.183:2379",
        "member",
        "add",
        "ip-172-28-249-121.eu-central-1.compute.internal",
        "https://172.28.249.121:2380"
    ],
    "delta": "0:00:00.048407",
    "end": "2019-11-13 00:17:32.010531",
    "failed_when_result": true,
    "invocation": {
        "module_args": {
            "_raw_params": "etcdctl\n --cert-file /etc/etcd/peer.crt\n --key-file /etc/etcd/peer.key\n --ca-file /etc/etcd/ca.crt\n -C https://172.28.248.183:2379\n member add ip-172-28-249-121.eu-central-1.compute.internal https://172.28.249.121:2380",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2019-11-13 00:17:31.962124",
    "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.28.248.35:2379 has no leader\n; error #1: client: etcd member https://172.28.248.183:2379 has no leader",
    "stderr_lines": [
        "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.28.248.35:2379 has no leader",
        "; error #1: client: etcd member https://172.28.248.183:2379 has no leader"
    ],
    "stdout": "",
    "stdout_lines": []
}



Version-Release number of selected component (if applicable):
OCP 3.9



How reproducible:
100%



Steps to Reproduce:

- Scaledup etcd node "172.28.249.121", but it failed as etcd service did not start:
~~~
[root@ip-172-28-248-35 ~]#  etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table  member list
+------------------+-----------+-------------------------------------------------+-----------------------------+-----------------------------+
|        ID        |  STATUS   |                      NAME                       |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+-----------+-------------------------------------------------+-----------------------------+-----------------------------+
| 1338325f965575c7 | unstarted |                                                 | https://172.28.249.121:2380 |                             |
| 3bfe46eea399c0d5 |   started | ip-172-28-248-183.eu-central-1.compute.internal | https://172.28.248.183:2380 | https://172.28.248.183:2379 |
| 7c724f31b1875832 |   started |  ip-172-28-248-35.eu-central-1.compute.internal |  https://172.28.248.35:2380 |  https://172.28.248.35:2379 |
+------------------+-----------+-------------------------------------------------+-----------------------------+-----------------------------+
~~~

- etcd service did not start because it could not contact other etcd peers since certificate signed by unknown authority:
~~~
Nov 13 11:11:59 ip-172-28-249-121.eu-central-1.compute.internal etcd[24108]: could not get cluster response from https://172.28.248.183:2380: Get https://172.28.248.183:2380/members: x509: certificate signed by unknown authority
Nov 13 11:11:59 ip-172-28-249-121.eu-central-1.compute.internal etcd[24108]: could not get cluster response from https://172.28.248.35:2380: Get https://172.28.248.35:2380/members: x509: certificate signed by unknown authority
Nov 13 11:11:59 ip-172-28-249-121.eu-central-1.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
~~~

- On 172-28-249-121 node, we see that etcd signer is different, which is causing the issue here:
~~~
[root@ip-172-28-249-121 etc]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer
        Issuer: CN=etcd-signer@1573601807
                DirName:/CN=etcd-signer@1573601807

[root@ip-172-28-248-35 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer
        Issuer: CN=etcd-signer@1565256310
                DirName:/CN=etcd-signer@1565256310

[root@ip-172-28-248-183 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer
        Issuer: CN=etcd-signer@1565256310
                DirName:/CN=etcd-signer@1565256310


Actual results:
The signers of the peer.crt are different[3]. We can see on -35 that the peer.crt was generated back in august. And on the 'new' etcd peer that was sucessfully scaled that the peer.crt was created yesterday when we ran the playbook (Nov 12 18:20 peer.crt). On the 'bad' node that fails to scaleup we see the peer.crt is also generated during the run, but it has a new/different signer [4]. 


[3]
~~~
[root@ip-172-28-249-121 etc]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer
        Issuer: CN=etcd-signer@1573601807
                DirName:/CN=etcd-signer@1573601807

[root@ip-172-28-248-35 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer
        Issuer: CN=etcd-signer@1565256310
                DirName:/CN=etcd-signer@1565256310

[root@ip-172-28-248-183 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer
        Issuer: CN=etcd-signer@1565256310
                DirName:/CN=etcd-signer@1565256310
~~~

[4]
121:
~~~~
[root@ibm-p8-kvm-03-guest-02 etcd]# ll
total 48
drwxr-xr-x. 2 root    root   4096 Nov 13 06:11 ca
-rw-------. 1 openvpn chrony 1895 Nov 12 18:37 ca.crt
-rw-r--r--. 1 root    root   1657 Nov 13 06:11 etcd.conf
-rw-r--r--. 1 root    root   1686 Feb 13  2019 etcd.conf.24018.2019-11-13@11:11:57~
-rw-------. 1 openvpn chrony 6069 Nov 13 06:11 peer.crt
-rw-------. 1 root    root   1090 Nov 13 06:11 peer.csr
-rw-------. 1 openvpn chrony 1704 Nov 13 06:11 peer.key
-rw-------. 1 openvpn chrony 6022 Nov 13 06:11 server.crt
-rw-------. 1 root    root   1090 Nov 13 06:11 server.csr
-rw-------. 1 openvpn chrony 1704 Nov 13 06:11 server.key
~~~~

35:
~~~~
[root@ibm-p8-kvm-03-guest-02 etcd]# ll
total 96
drwx------. 5 root     root     4096 Nov 12 18:20 ca
-rw-------. 1 pipewire openvpn  5685 Aug  8 05:25 ca.crt
-rw-r--r--. 1 root     root    38543 Aug  8 08:26 etcdca.tar.gz
-rw-r--r--. 1 root     root     1652 Aug  8 08:44 etcd.conf
-rw-r--r--. 1 root     root     1686 Feb 13  2019 etcd.conf.29232.2019-08-08@13:44:35~
drwx------. 3 root     root     4096 Nov 12 18:21 generated_certs
-rw-------. 1 pipewire openvpn  6063 Aug  8 08:43 peer.crt
-rw-r--r--. 1 root     root     1090 Aug  8 08:43 peer.csr
-rw-------. 1 pipewire openvpn  1704 Aug  8 08:43 peer.key
-rw-------. 1 pipewire openvpn  6020 Aug  8 08:43 server.crt
-rw-r--r--. 1 root     root     1090 Aug  8 08:43 server.csr
-rw-------. 1 pipewire openvpn  1704 Aug  8 08:43 server.key
~~~~

183:
~~~~
[root@ibm-p8-kvm-03-guest-02 etcd]# ll
total 56
drwxr-xr-x. 5 root    root         4096 Nov 13 06:11 ca
-rw-------. 1 polkitd pulse-access 5685 Aug  8 05:25 ca.crt
-rw-r--r--. 1 root    root         1581 Nov 12 18:21 etcd.conf
-rw-r--r--. 1 root    root         1686 Feb 13  2019 etcd.conf.6213.2019-11-12@23:21:43~
drwx------. 4 root    root         4096 Nov 13 06:11 generated_certs
-rw-------. 1 polkitd pulse-access 6070 Nov 12 18:20 peer.crt
-rw-------. 1 root    root         1090 Nov 12 18:20 peer.csr
-rw-------. 1 polkitd pulse-access 1704 Nov 12 18:20 peer.key
-rw-------. 1 polkitd pulse-access 6023 Nov 12 18:20 server.crt
-rw-------. 1 root    root         1090 Nov 12 18:20 server.csr
-rw-------. 1 polkitd pulse-access 1704 Nov 12 18:20 server.key
~~~~
Expected results:
A 3rd etcd member should have scaled up properly, without certificate issues

Additional info:

Comment 1 Aja Lightner 2019-11-13 19:35:30 UTC
Created attachment 1635904 [details]
Leader of etcd (tarball) - 172.28.248.35

Comment 2 Aja Lightner 2019-11-13 19:38:53 UTC
Created attachment 1635905 [details]
Healthy etcd peer - ip-172-28-248-183

Comment 3 Aja Lightner 2019-11-13 19:46:29 UTC
Created attachment 1635924 [details]
Sosreport

Comment 4 Aja Lightner 2019-11-14 23:22:44 UTC
The customer was able to provide a workaround, but would like an RCA on the following:

- clarity about etcd cluster ca/certificate issuer, etcd leader, first etcd in ansible hosts: what's the role of each of these in a etcd cluster and how do they relate to each other.

CURRENT WORKAROUND:

The customer updated the below playbook to ensure etcd_ca_host was pointing to the correct node CA issuer (first node in the cluster  ip-172-28-248-35).
It looks like the section in the playbook where etcd_ca_host is set is not the correct one as they were in a situation where cluster was already running 2 other nodes . Also when it says "we will default to the first member of the etcd host group" , the first member it points to is ip-172-28-248-179, which is not the first member in ansible hosts file and also not the right member to issue the certificate [0]. So he changed line groups[etcd_ca_host_group].0 to groups[etcd_ca_host_group].1 to ensure the certificate was issued by  ip-172-28-248-35 and the scaleup succeeded [1].


[0]
~~~
private/roles/openshift_etcd_facts/defaults/main.yml:etcd_ca_host_group: "oo_etcd_to_config"


TASK [Evaluate oo_etcd_to_config] **********************************************************************************************************************************************************************************
ok: [localhost] => (item=ip-172-28-248-179.eu-central-1.compute.internal)
ok: [localhost] => (item=ip-172-28-248-35.eu-central-1.compute.internal)


ansible/hosts
    etcd:
      hosts:
        ip-172-28-248-35.eu-central-1.compute.internal:
          instance_id: i-0bb72297f5cd90084
        ip-172-28-248-179.eu-central-1.compute.internal:
          instance_id: i-081338b724bfe2111



[1]
~~~
/usr/share/ansible/openshift-ansible/playbooks/openshift-etcd
[root@ip-172-28-248-17 openshift-etcd]# cat private/roles/openshift_etcd_facts/tasks/set_etcd_ca_host.yml
(...)
# No etcd_ca_host was found in __etcd_ca_hosts. This is probably a
# fresh installation so we will default to the first member of the
# etcd host group.
- set_fact:
    etcd_ca_host: "{{ groups[etcd_ca_host_group].1 }}"   <<======================================= This was initially groups[etcd_ca_host_group].0
  when:
  - etcd_ca_host is not defined

Comment 5 Stephen Cuppett 2019-11-21 12:56:34 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking un-triaged bugs CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Version to the appropriate version where reproduced.

[1]: https://access.redhat.com/support/policy/updates/openshift


Note You need to log in before you can comment on or make changes to this bug.