Created attachment 1635901 [details] "Bad node" - ip-172-28-249-121 Description of problem: Scale up play for etcd fails to add 3rd node to cluster (ip-172-28-249-121). After restoring a failed etcd cluster [0], and making it a single peer cluster [1], etcd was up/running and oc commands were working. We were able to scale up one of the etcd peers. We then had 2 healthy etcd peers. Tried to scale up the 3rd member and received some certificate issues. Rather than redeploying the client and server ca, we decided the easier path was to delete the AWS guest and let the autoscaler scale it back up. However, the scale up failed with the error below [2]. Why is the etcd-signer different on 172-28-249-121 node. Why did it work for 2nd etcd node which was scaled yesterday too? [0] https://access.redhat.com/solutions/4013381 [1] https://docs.openshift.com/container-platform/3.9/admin_guide/assembly_restore-etcd-quorum.html#cluster-restore-etcd-quorum-single-node_restore-etcd-quorum [2] fatal: [ip-172-28-249-121.eu-central-1.compute.internal -> ip-172-28-248-183.eu-central-1.compute.internal]: FAILED! => { "attempts": 3, "changed": true, "cmd": [ "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://172.28.248.183:2379", "member", "add", "ip-172-28-249-121.eu-central-1.compute.internal", "https://172.28.249.121:2380" ], "delta": "0:00:00.048407", "end": "2019-11-13 00:17:32.010531", "failed_when_result": true, "invocation": { "module_args": { "_raw_params": "etcdctl\n --cert-file /etc/etcd/peer.crt\n --key-file /etc/etcd/peer.key\n --ca-file /etc/etcd/ca.crt\n -C https://172.28.248.183:2379\n member add ip-172-28-249-121.eu-central-1.compute.internal https://172.28.249.121:2380", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2019-11-13 00:17:31.962124", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.28.248.35:2379 has no leader\n; error #1: client: etcd member https://172.28.248.183:2379 has no leader", "stderr_lines": [ "client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://172.28.248.35:2379 has no leader", "; error #1: client: etcd member https://172.28.248.183:2379 has no leader" ], "stdout": "", "stdout_lines": [] } Version-Release number of selected component (if applicable): OCP 3.9 How reproducible: 100% Steps to Reproduce: - Scaledup etcd node "172.28.249.121", but it failed as etcd service did not start: ~~~ [root@ip-172-28-248-35 ~]# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table member list +------------------+-----------+-------------------------------------------------+-----------------------------+-----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+-----------+-------------------------------------------------+-----------------------------+-----------------------------+ | 1338325f965575c7 | unstarted | | https://172.28.249.121:2380 | | | 3bfe46eea399c0d5 | started | ip-172-28-248-183.eu-central-1.compute.internal | https://172.28.248.183:2380 | https://172.28.248.183:2379 | | 7c724f31b1875832 | started | ip-172-28-248-35.eu-central-1.compute.internal | https://172.28.248.35:2380 | https://172.28.248.35:2379 | +------------------+-----------+-------------------------------------------------+-----------------------------+-----------------------------+ ~~~ - etcd service did not start because it could not contact other etcd peers since certificate signed by unknown authority: ~~~ Nov 13 11:11:59 ip-172-28-249-121.eu-central-1.compute.internal etcd[24108]: could not get cluster response from https://172.28.248.183:2380: Get https://172.28.248.183:2380/members: x509: certificate signed by unknown authority Nov 13 11:11:59 ip-172-28-249-121.eu-central-1.compute.internal etcd[24108]: could not get cluster response from https://172.28.248.35:2380: Get https://172.28.248.35:2380/members: x509: certificate signed by unknown authority Nov 13 11:11:59 ip-172-28-249-121.eu-central-1.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE ~~~ - On 172-28-249-121 node, we see that etcd signer is different, which is causing the issue here: ~~~ [root@ip-172-28-249-121 etc]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer Issuer: CN=etcd-signer@1573601807 DirName:/CN=etcd-signer@1573601807 [root@ip-172-28-248-35 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer Issuer: CN=etcd-signer@1565256310 DirName:/CN=etcd-signer@1565256310 [root@ip-172-28-248-183 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer Issuer: CN=etcd-signer@1565256310 DirName:/CN=etcd-signer@1565256310 Actual results: The signers of the peer.crt are different[3]. We can see on -35 that the peer.crt was generated back in august. And on the 'new' etcd peer that was sucessfully scaled that the peer.crt was created yesterday when we ran the playbook (Nov 12 18:20 peer.crt). On the 'bad' node that fails to scaleup we see the peer.crt is also generated during the run, but it has a new/different signer [4]. [3] ~~~ [root@ip-172-28-249-121 etc]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer Issuer: CN=etcd-signer@1573601807 DirName:/CN=etcd-signer@1573601807 [root@ip-172-28-248-35 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer Issuer: CN=etcd-signer@1565256310 DirName:/CN=etcd-signer@1565256310 [root@ip-172-28-248-183 ~]# openssl x509 -in /etc/etcd/peer.crt -text -noout | grep etcd-signer Issuer: CN=etcd-signer@1565256310 DirName:/CN=etcd-signer@1565256310 ~~~ [4] 121: ~~~~ [root@ibm-p8-kvm-03-guest-02 etcd]# ll total 48 drwxr-xr-x. 2 root root 4096 Nov 13 06:11 ca -rw-------. 1 openvpn chrony 1895 Nov 12 18:37 ca.crt -rw-r--r--. 1 root root 1657 Nov 13 06:11 etcd.conf -rw-r--r--. 1 root root 1686 Feb 13 2019 etcd.conf.24018.2019-11-13@11:11:57~ -rw-------. 1 openvpn chrony 6069 Nov 13 06:11 peer.crt -rw-------. 1 root root 1090 Nov 13 06:11 peer.csr -rw-------. 1 openvpn chrony 1704 Nov 13 06:11 peer.key -rw-------. 1 openvpn chrony 6022 Nov 13 06:11 server.crt -rw-------. 1 root root 1090 Nov 13 06:11 server.csr -rw-------. 1 openvpn chrony 1704 Nov 13 06:11 server.key ~~~~ 35: ~~~~ [root@ibm-p8-kvm-03-guest-02 etcd]# ll total 96 drwx------. 5 root root 4096 Nov 12 18:20 ca -rw-------. 1 pipewire openvpn 5685 Aug 8 05:25 ca.crt -rw-r--r--. 1 root root 38543 Aug 8 08:26 etcdca.tar.gz -rw-r--r--. 1 root root 1652 Aug 8 08:44 etcd.conf -rw-r--r--. 1 root root 1686 Feb 13 2019 etcd.conf.29232.2019-08-08@13:44:35~ drwx------. 3 root root 4096 Nov 12 18:21 generated_certs -rw-------. 1 pipewire openvpn 6063 Aug 8 08:43 peer.crt -rw-r--r--. 1 root root 1090 Aug 8 08:43 peer.csr -rw-------. 1 pipewire openvpn 1704 Aug 8 08:43 peer.key -rw-------. 1 pipewire openvpn 6020 Aug 8 08:43 server.crt -rw-r--r--. 1 root root 1090 Aug 8 08:43 server.csr -rw-------. 1 pipewire openvpn 1704 Aug 8 08:43 server.key ~~~~ 183: ~~~~ [root@ibm-p8-kvm-03-guest-02 etcd]# ll total 56 drwxr-xr-x. 5 root root 4096 Nov 13 06:11 ca -rw-------. 1 polkitd pulse-access 5685 Aug 8 05:25 ca.crt -rw-r--r--. 1 root root 1581 Nov 12 18:21 etcd.conf -rw-r--r--. 1 root root 1686 Feb 13 2019 etcd.conf.6213.2019-11-12@23:21:43~ drwx------. 4 root root 4096 Nov 13 06:11 generated_certs -rw-------. 1 polkitd pulse-access 6070 Nov 12 18:20 peer.crt -rw-------. 1 root root 1090 Nov 12 18:20 peer.csr -rw-------. 1 polkitd pulse-access 1704 Nov 12 18:20 peer.key -rw-------. 1 polkitd pulse-access 6023 Nov 12 18:20 server.crt -rw-------. 1 root root 1090 Nov 12 18:20 server.csr -rw-------. 1 polkitd pulse-access 1704 Nov 12 18:20 server.key ~~~~ Expected results: A 3rd etcd member should have scaled up properly, without certificate issues Additional info:
Created attachment 1635904 [details] Leader of etcd (tarball) - 172.28.248.35
Created attachment 1635905 [details] Healthy etcd peer - ip-172-28-248-183
Created attachment 1635924 [details] Sosreport
The customer was able to provide a workaround, but would like an RCA on the following: - clarity about etcd cluster ca/certificate issuer, etcd leader, first etcd in ansible hosts: what's the role of each of these in a etcd cluster and how do they relate to each other. CURRENT WORKAROUND: The customer updated the below playbook to ensure etcd_ca_host was pointing to the correct node CA issuer (first node in the cluster ip-172-28-248-35). It looks like the section in the playbook where etcd_ca_host is set is not the correct one as they were in a situation where cluster was already running 2 other nodes . Also when it says "we will default to the first member of the etcd host group" , the first member it points to is ip-172-28-248-179, which is not the first member in ansible hosts file and also not the right member to issue the certificate [0]. So he changed line groups[etcd_ca_host_group].0 to groups[etcd_ca_host_group].1 to ensure the certificate was issued by ip-172-28-248-35 and the scaleup succeeded [1]. [0] ~~~ private/roles/openshift_etcd_facts/defaults/main.yml:etcd_ca_host_group: "oo_etcd_to_config" TASK [Evaluate oo_etcd_to_config] ********************************************************************************************************************************************************************************** ok: [localhost] => (item=ip-172-28-248-179.eu-central-1.compute.internal) ok: [localhost] => (item=ip-172-28-248-35.eu-central-1.compute.internal) ansible/hosts etcd: hosts: ip-172-28-248-35.eu-central-1.compute.internal: instance_id: i-0bb72297f5cd90084 ip-172-28-248-179.eu-central-1.compute.internal: instance_id: i-081338b724bfe2111 [1] ~~~ /usr/share/ansible/openshift-ansible/playbooks/openshift-etcd [root@ip-172-28-248-17 openshift-etcd]# cat private/roles/openshift_etcd_facts/tasks/set_etcd_ca_host.yml (...) # No etcd_ca_host was found in __etcd_ca_hosts. This is probably a # fresh installation so we will default to the first member of the # etcd host group. - set_fact: etcd_ca_host: "{{ groups[etcd_ca_host_group].1 }}" <<======================================= This was initially groups[etcd_ca_host_group].0 when: - etcd_ca_host is not defined
OCP 3.6-3.10 is no longer on full support [1]. Marking un-triaged bugs CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Version to the appropriate version where reproduced. [1]: https://access.redhat.com/support/policy/updates/openshift