Bug 1772161
| Summary: | OCP 3.9: etcd remains unhealthy with cert errors after trying to scaleup using AWS provisioning playbooks | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Aja Lightner <alightne> | ||||||||||
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||||||||
| Status: | CLOSED DEFERRED | QA Contact: | ge liu <geliu> | ||||||||||
| Severity: | unspecified | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 3.9.0 | ||||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2019-11-21 12:56:34 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Aja Lightner
2019-11-13 19:30:32 UTC
Created attachment 1635904 [details]
Leader of etcd (tarball) - 172.28.248.35
Created attachment 1635905 [details]
Healthy etcd peer - ip-172-28-248-183
Created attachment 1635924 [details]
Sosreport
The customer was able to provide a workaround, but would like an RCA on the following:
- clarity about etcd cluster ca/certificate issuer, etcd leader, first etcd in ansible hosts: what's the role of each of these in a etcd cluster and how do they relate to each other.
CURRENT WORKAROUND:
The customer updated the below playbook to ensure etcd_ca_host was pointing to the correct node CA issuer (first node in the cluster ip-172-28-248-35).
It looks like the section in the playbook where etcd_ca_host is set is not the correct one as they were in a situation where cluster was already running 2 other nodes . Also when it says "we will default to the first member of the etcd host group" , the first member it points to is ip-172-28-248-179, which is not the first member in ansible hosts file and also not the right member to issue the certificate [0]. So he changed line groups[etcd_ca_host_group].0 to groups[etcd_ca_host_group].1 to ensure the certificate was issued by ip-172-28-248-35 and the scaleup succeeded [1].
[0]
~~~
private/roles/openshift_etcd_facts/defaults/main.yml:etcd_ca_host_group: "oo_etcd_to_config"
TASK [Evaluate oo_etcd_to_config] **********************************************************************************************************************************************************************************
ok: [localhost] => (item=ip-172-28-248-179.eu-central-1.compute.internal)
ok: [localhost] => (item=ip-172-28-248-35.eu-central-1.compute.internal)
ansible/hosts
etcd:
hosts:
ip-172-28-248-35.eu-central-1.compute.internal:
instance_id: i-0bb72297f5cd90084
ip-172-28-248-179.eu-central-1.compute.internal:
instance_id: i-081338b724bfe2111
[1]
~~~
/usr/share/ansible/openshift-ansible/playbooks/openshift-etcd
[root@ip-172-28-248-17 openshift-etcd]# cat private/roles/openshift_etcd_facts/tasks/set_etcd_ca_host.yml
(...)
# No etcd_ca_host was found in __etcd_ca_hosts. This is probably a
# fresh installation so we will default to the first member of the
# etcd host group.
- set_fact:
etcd_ca_host: "{{ groups[etcd_ca_host_group].1 }}" <<======================================= This was initially groups[etcd_ca_host_group].0
when:
- etcd_ca_host is not defined
OCP 3.6-3.10 is no longer on full support [1]. Marking un-triaged bugs CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Version to the appropriate version where reproduced. [1]: https://access.redhat.com/support/policy/updates/openshift |