1852309 – [Webscale] Degraded etcd-operator because etcd-bootstrap member

Bug 1852309 - [Webscale] Degraded etcd-operator because etcd-bootstrap member

Summary: [Webscale] Degraded etcd-operator because etcd-bootstrap member

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Dan Mace
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-30 06:10 UTC by Pedro Ibáñez
Modified:	2020-08-18 15:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-18 15:00:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1832986	0	high	CLOSED	EtcdMembersDegraded false alarms	2023-12-15 17:51:28 UTC

Internal Links: 1832923

Description Pedro Ibáñez 2020-06-30 06:10:56 UTC

Description of problem:

When I installed Baremetal OpenShift 4.4.6 IPv4, the installer would fail because the Bootstrap VM would look on the following line when looking at "journalctl".
Jun 09 19:19:57 localhost bootkube.sh[17524]: E0609 19:19:57.363437       1 reflector.go:153] k8s.io/client-go.1/tools/cache/reflector.go:105: Failed to list *v1.Etcd: Get https://api-int.ocp1.vio-sea.pd.f5net.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: dial tcp: lookup api-int.ocp1.vio-sea.pd.f5net.com on 172.27.1.1:53: no such host

I would continue to wait about 1 - 2 hours after the installer fails and the OCP cluster does successfully get configured. Since then there has been a lot of development activity (from SPK project) and I've upgraded the cluster to 4.4.7. I took a look at the co (oc get co) and noticed etcd operator is in a degraded state.

When checking for the etcd-members:
sh-4.2# etcdctl member list -w table
+------------------+---------+----------------------------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |                     NAME                     |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+----------------------------------------------+-----------------------------+-----------------------------+
| 4402c520d4d28fc7 | started |                               etcd-bootstrap | https://10.146.134.208:2380 | https://10.146.134.208:2379 |
| 7753e9ceaf62e826 | started | openshift-master-0.ocp1.vio-sea.pd.f5net.com | https://10.146.134.219:2380 | https://10.146.134.219:2379 |
| 7c8d3ba3336c1c07 | started | openshift-master-1.ocp1.vio-sea.pd.f5net.com | https://10.146.134.218:2380 | https://10.146.134.218:2379 |
| eef2863f4b4e4b71 | started | openshift-master-2.ocp1.vio-sea.pd.f5net.com | https://10.146.134.217:2380 | https://10.146.134.217:2379 |

When I removed the etch-bootstrap member, the etcd operator goes back into the normal healthy state.
sh-4.2# etcdctl endpoint health --cluster
https://10.146.134.219:2379 is healthy: successfully committed proposal: took = 18.29364ms
https://10.146.134.217:2379 is healthy: successfully committed proposal: took = 24.846313ms
https://10.146.134.218:2379 is healthy: successfully committed proposal: took = 28.421781ms


When does the behavior occur? Frequency? Repeatedly? At certain times?
I've only seen this once but have yet to reproduce it

Version-Release number of selected component (if applicable):
4.4.6 , 4.4.6 to 4.4.7 upgrade

How reproducible:
Install OCP 4.4.6 in BM,upgrade to 4.4.7

Steps to Reproduce:
1.Install OCP 4.4.6 in BM
2.Upgrade to 4.4.7


Actual results:
The etcd co is degraded because the etcd-bootstrap member wasn't removed.

Expected results:
The etcd co is healthy.

Additional info:
A case has been open with the must-gather attachment: https://access.redhat.com/support/cases/#/case/02690705
This bug can be related:https://bugzilla.redhat.com/show_bug.cgi?id=1832986
Following the KCS my cluster doesn't have that annotation (checked after removing the etcd-bootstrap member): https://access.redhat.com/solutions/5161361

Comment 7 Dan Mace 2020-08-18 15:00:59 UTC

Closing because there's no reproducer, no evidence of an etcd issue, and the latest info from the customer reaffirms suspicions from the original report that there was an invalid or problematic DNS configuration at play confounding apiserver connectivity.

If there's some evidence this is still happening with the latest 4.4.z releases, or if there's a reproducer, please let us know.

Note You need to log in before you can comment on or make changes to this bug.