Description of problem: When I installed Baremetal OpenShift 4.4.6 IPv4, the installer would fail because the Bootstrap VM would look on the following line when looking at "journalctl". Jun 09 19:19:57 localhost bootkube.sh[17524]: E0609 19:19:57.363437 1 reflector.go:153] k8s.io/client-go.1/tools/cache/reflector.go:105: Failed to list *v1.Etcd: Get https://api-int.ocp1.vio-sea.pd.f5net.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: dial tcp: lookup api-int.ocp1.vio-sea.pd.f5net.com on 172.27.1.1:53: no such host I would continue to wait about 1 - 2 hours after the installer fails and the OCP cluster does successfully get configured. Since then there has been a lot of development activity (from SPK project) and I've upgraded the cluster to 4.4.7. I took a look at the co (oc get co) and noticed etcd operator is in a degraded state. When checking for the etcd-members: sh-4.2# etcdctl member list -w table +------------------+---------+----------------------------------------------+-----------------------------+-----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+----------------------------------------------+-----------------------------+-----------------------------+ | 4402c520d4d28fc7 | started | etcd-bootstrap | https://10.146.134.208:2380 | https://10.146.134.208:2379 | | 7753e9ceaf62e826 | started | openshift-master-0.ocp1.vio-sea.pd.f5net.com | https://10.146.134.219:2380 | https://10.146.134.219:2379 | | 7c8d3ba3336c1c07 | started | openshift-master-1.ocp1.vio-sea.pd.f5net.com | https://10.146.134.218:2380 | https://10.146.134.218:2379 | | eef2863f4b4e4b71 | started | openshift-master-2.ocp1.vio-sea.pd.f5net.com | https://10.146.134.217:2380 | https://10.146.134.217:2379 | When I removed the etch-bootstrap member, the etcd operator goes back into the normal healthy state. sh-4.2# etcdctl endpoint health --cluster https://10.146.134.219:2379 is healthy: successfully committed proposal: took = 18.29364ms https://10.146.134.217:2379 is healthy: successfully committed proposal: took = 24.846313ms https://10.146.134.218:2379 is healthy: successfully committed proposal: took = 28.421781ms When does the behavior occur? Frequency? Repeatedly? At certain times? I've only seen this once but have yet to reproduce it Version-Release number of selected component (if applicable): 4.4.6 , 4.4.6 to 4.4.7 upgrade How reproducible: Install OCP 4.4.6 in BM,upgrade to 4.4.7 Steps to Reproduce: 1.Install OCP 4.4.6 in BM 2.Upgrade to 4.4.7 Actual results: The etcd co is degraded because the etcd-bootstrap member wasn't removed. Expected results: The etcd co is healthy. Additional info: A case has been open with the must-gather attachment: https://access.redhat.com/support/cases/#/case/02690705 This bug can be related:https://bugzilla.redhat.com/show_bug.cgi?id=1832986 Following the KCS my cluster doesn't have that annotation (checked after removing the etcd-bootstrap member): https://access.redhat.com/solutions/5161361
Closing because there's no reproducer, no evidence of an etcd issue, and the latest info from the customer reaffirms suspicions from the original report that there was an invalid or problematic DNS configuration at play confounding apiserver connectivity. If there's some evidence this is still happening with the latest 4.4.z releases, or if there's a reproducer, please let us know.