Bug 1852309

Summary:	[Webscale] Degraded etcd-operator because etcd-bootstrap member
Product:	OpenShift Container Platform	Reporter:	Pedro Ibáñez <pibanezr>
Component:	Etcd	Assignee:	Dan Mace <dmace>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.4	CC:	dahernan, dmace, wlewis
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-18 15:00:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pedro Ibáñez 2020-06-30 06:10:56 UTC

Description of problem:

When I installed Baremetal OpenShift 4.4.6 IPv4, the installer would fail because the Bootstrap VM would look on the following line when looking at "journalctl".
Jun 09 19:19:57 localhost bootkube.sh[17524]: E0609 19:19:57.363437       1 reflector.go:153] k8s.io/client-go.1/tools/cache/reflector.go:105: Failed to list *v1.Etcd: Get https://api-int.ocp1.vio-sea.pd.f5net.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: dial tcp: lookup api-int.ocp1.vio-sea.pd.f5net.com on 172.27.1.1:53: no such host

I would continue to wait about 1 - 2 hours after the installer fails and the OCP cluster does successfully get configured. Since then there has been a lot of development activity (from SPK project) and I've upgraded the cluster to 4.4.7. I took a look at the co (oc get co) and noticed etcd operator is in a degraded state.

When checking for the etcd-members:
sh-4.2# etcdctl member list -w table
+------------------+---------+----------------------------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |                     NAME                     |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+----------------------------------------------+-----------------------------+-----------------------------+
| 4402c520d4d28fc7 | started |                               etcd-bootstrap | https://10.146.134.208:2380 | https://10.146.134.208:2379 |
| 7753e9ceaf62e826 | started | openshift-master-0.ocp1.vio-sea.pd.f5net.com | https://10.146.134.219:2380 | https://10.146.134.219:2379 |
| 7c8d3ba3336c1c07 | started | openshift-master-1.ocp1.vio-sea.pd.f5net.com | https://10.146.134.218:2380 | https://10.146.134.218:2379 |
| eef2863f4b4e4b71 | started | openshift-master-2.ocp1.vio-sea.pd.f5net.com | https://10.146.134.217:2380 | https://10.146.134.217:2379 |

When I removed the etch-bootstrap member, the etcd operator goes back into the normal healthy state.
sh-4.2# etcdctl endpoint health --cluster
https://10.146.134.219:2379 is healthy: successfully committed proposal: took = 18.29364ms
https://10.146.134.217:2379 is healthy: successfully committed proposal: took = 24.846313ms
https://10.146.134.218:2379 is healthy: successfully committed proposal: took = 28.421781ms


When does the behavior occur? Frequency? Repeatedly? At certain times?
I've only seen this once but have yet to reproduce it

Version-Release number of selected component (if applicable):
4.4.6 , 4.4.6 to 4.4.7 upgrade

How reproducible:
Install OCP 4.4.6 in BM,upgrade to 4.4.7

Steps to Reproduce:
1.Install OCP 4.4.6 in BM
2.Upgrade to 4.4.7


Actual results:
The etcd co is degraded because the etcd-bootstrap member wasn't removed.

Expected results:
The etcd co is healthy.

Additional info:
A case has been open with the must-gather attachment: https://access.redhat.com/support/cases/#/case/02690705
This bug can be related:https://bugzilla.redhat.com/show_bug.cgi?id=1832986
Following the KCS my cluster doesn't have that annotation (checked after removing the etcd-bootstrap member): https://access.redhat.com/solutions/5161361

Comment 7 Dan Mace 2020-08-18 15:00:59 UTC

Closing because there's no reproducer, no evidence of an etcd issue, and the latest info from the customer reaffirms suspicions from the original report that there was an invalid or problematic DNS configuration at play confounding apiserver connectivity.

If there's some evidence this is still happening with the latest 4.4.z releases, or if there's a reproducer, please let us know.