Description of problem: After fresh installation of 4.4.3 or upgrading from 4.3 to 4.4.3, etcd operator consistently shows these logs: ~~~ 57m Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 5\nEtcdMembersAvailable: node3.example.com,node4.example.com,node5.example.com members are available, have not started, are unhealthy, are unknown" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 5\nEtcdMembersAvailable: node3.example.com,node5.example.com members are available, have not started, node4.example.com are unhealthy, are unknown" ~~~ Etcd cluster is healthy and all etcd members are available.
*** Bug 1834976 has been marked as a duplicate of this bug. ***
Hello, Cu has provided more information on the issue as follows The cluster is experiencing regular operator failures for various operators, logging messages such as regarding pods for a few operators such as. Often the operator's pods redeploy shortly after: community-operators-57bcd7676c-v4pw2 Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s catalog-operator-6c6c549cf7-8p542 Liveness probe failed: Get http://10.130.0.19:8080/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) We presume this could be related to etcd. We have a different cluster which is running on vSphere Infrastructure (rather than OpenStack), deployed using UPI (rather than IPI) which appears to have this identical issue. We have been unable to deploy any 4.4 cluster without experiencing this issue.
*** Bug 1835368 has been marked as a duplicate of this bug. ***
Hello, Cu reported back with useful information regarding the issue. ````````````````` This issue is becoming urgent because our largest customer has requested an OpenShift 4.4 cluster. I have noticed something odd which could be related - in all clusters (OpenStack IPI and VMware UPI), the logs for etcd-operator pod (in openshift-etcd-operator project) is showing continual and repeated connection errors, trying to connect to the IP address of the bootstrap VM: From VMware-based UPI cluster: $ oc logs etcd-operator-5d97b6445f-6xjhh | grep dialling ... W0514 14:33:04.546487 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: connect: no route to host". Reconnecting... W0514 14:33:07.345430 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: operation was canceled". Reconnecting... W0514 14:33:08.642433 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: connect: no route to host". Reconnecting... and in the OpenStack IPI cluster (where the earlier logs uploaded to this case came from) $ oc logs etcd-operator-59cf47554b-m95s9 | grep dialing W0514 14:52:00.115637 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting... W0514 14:52:00.422120 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting... W0514 14:52:00.436588 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting... W0514 14:52:00.976723 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: connect: no route to host". Reconnecting... As we know, one of the final steps of an install is to delete the bootstrap VM; manually in UPI and automatically in IPI. In our UPI install, we delete the bootstrap about 30 seconds after openshift-install wait-for bootstrap-complete completes. Therefore, the etcd-operator should NOT be trying to connect to the bootstrap. Note - I have confirmed that the unhealthy member issue (and the bootstrap etcd-operator behaviour) remains after the cluster is updated to version 4.4.4 `````````````````
i also tried an upgrade to 4.4.3 from 4.3.19 last night. the upgrade is stalled because of a degraded etcd operator [ocp41@dhmaapp286 ocp4upc]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.19 True True 10h Working towards 4.4.3: 11% complete [ocp41@dhmaapp286 ocp4upc]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.19 True False False 121d cloud-credential 4.3.19 True False False 121d cluster-autoscaler 4.3.19 True False False 121d console 4.3.19 True False False 11h dns 4.3.19 True False False 23d etcd 4.4.3 False True True 10h image-registry 4.3.19 True False False 11h ingress 4.3.19 True False False 22h insights 4.3.19 True False False 120d kube-apiserver 4.3.19 True False False 121d kube-controller-manager 4.3.19 True False False 121d kube-scheduler 4.3.19 True False False 121d machine-api 4.3.19 True False False 121d machine-config 4.3.19 True False False 19h marketplace 4.3.19 True False False 11h monitoring 4.3.19 True False False 11h network 4.3.19 True False False 121d node-tuning 4.3.19 True False False 5h32m openshift-apiserver 4.3.19 True False False 11h openshift-controller-manager 4.3.19 True False False 23d openshift-samples 4.3.19 True False False 2m19s operator-lifecycle-manager 4.3.19 True False False 121d operator-lifecycle-manager-catalog 4.3.19 True False False 121d operator-lifecycle-manager-packageserver 4.3.19 True False False 11h service-ca 4.3.19 True False False 121d service-catalog-apiserver 4.3.19 True False False 121d service-catalog-controller-manager 4.3.19 True False False 121d storage 4.3.19 True False False 11h getting similar messages in the etcd operator logs [ocp41@dhmaapp286 ocp4upc]$ oc logs etcd-operator-59cf47554b-shccm |grep dial W0516 14:16:58.777781 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... W0516 14:20:56.916372 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... E0516 14:23:13.636242 1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr003.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.28:2379 with maintenance client: context canceled"} W0516 14:25:13.039698 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... E0516 14:27:13.618487 1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr002.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.27:2379 with maintenance client: context canceled"} W0516 14:30:38.613126 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... W0516 14:35:49.848968 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... W0516 14:43:13.557082 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... W0516 14:43:54.288387 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.28:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.28:2379: operation was canceled". Reconnecting... E0516 14:45:13.597490 1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr003.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.28:2379 with maintenance client: context canceled"} W0516 14:46:00.858231 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... W0516 14:46:13.720890 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting... W0516 14:47:01.127274 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.28:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.28:2379: operation was canceled". Reconnecting... E0516 14:56:13.578309 1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr002.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.27:2379 with maintenance client: context canceled"} E0516 15:01:13.590141 1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr001.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.26:2379 with maintenance client: context canceled"}
Same on 4.4.3 fresh install, but be aware, it's not only a false flag, etcd operator keeps restarting randomly from time to time about every 2hrs, and it could be stuck in degrading status for hours. In addition to this, etcd and etcd-operator still keep dialing BOOTSTRAP machine, even hours/days later the setup.
We have a fix for health checks that is going through the last review stages. The plan is to backport to 4.4.z. https://github.com/openshift/cluster-etcd-operator/pull/343
I would like to clarify the impact of this bug. Currently, we have a bug in cluster-etcd-operator that results in false negatives with our etcd health checks. etcd is still monitored by Prometheus and etcd-quorum-guard. In the event of an actual outage, you will receive Alerts against `etcdMemberDown`. So unless you are observing this Alert these events can safely be ignored.
*** Bug 1820734 has been marked as a duplicate of this bug. ***
Can confirm that it is still present in 4.4.4 release.
Fix will land in 4.4.6
Is there a workaround for this until the new version is released?
> Is there a workaround for this until the new version is released? From 4.4.4 release notes[1] When updating an OpenShift Container Platform cluster from version 4.3 to 4.4, the etcd Operator sometimes fails to upgrade because it is in a degraded state. This is caused by an InstallerPod failure. As a workaround, you must force a new revision on etcd to overcome the InstallerPod failure, which allows the etcd Operator to recover: Force a new revision on etcd: $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge Verify the nodes are at the latest revision: $ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' []1https://docs.openshift.com/container-platform/4.4/release_notes/ocp-4-4-release-notes.html
To be clear we are talking about 2 bugs. This BZ covers a bug that reports invalid health checks. So for example etcd eventing unhealthy member when the member is actually up. If you are seeing the issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c19 you are most likely seeing 1830510[1]. That fix landed in 4.4.4. # invalid events such as reported in the description[2] $ oc get events -n openshift-etcd-operator LAST SEEN TYPE REASON OBJECT MESSAGE 3s Warning UnhealthyEtcdMember deployment/etcd-operator unhealthy members: ip-10-0-137-159.us-west-1.compute.internal, ip-10-0-17-119.us-west-1.compute.internal [1] https://bugzilla.redhat.com/show_bug.cgi?id=1830510 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c0
Hello Team, Tried Updating to 4.4.6 does not seem to have resolve this issue. Can you confirm if the proposed fix was actually included? At the moment, the errata document attached to 4.4.6 appears to be 404 - https://access.redhat.com/errata/RHBA-2020:2310
looks like it missed 4.4.6 so 4.4.7 https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.4.6
Hello Sam, pls help to take a look comments 28
*** Bug 1844501 has been marked as a duplicate of this bug. ***
What is the probability this will land in 4.4.7? Thanks!
> What is the probability this will land in 4.4.7? Thanks! 100%
We still see the following switchover on 4.4.8, is it expected? NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE openshift-etcd-operator 2m Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" openshift-etcd-operator 2m Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
we could explore if the transient failure should result in degraded state but the underlying message is unrelated to this bug.
Hi, Which version number of the etcd container contains the fix? Is this already fixed in 4.4.7 or 4.4.8? Thanks, Andreas
4.4.7 https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c35
Hello, An upgrade from 4.4.6 to 4.4.7 got an etcd cluster with 4 members, the 4th member was the etcd-bootstrap. A manual remove with the "etcdctl member remove <memberid>" command solved the issue with the etcd operator as it was in a degraded status. Is there anything else to be done? Thanks
> An upgrade from 4.4.6 to 4.4.7 got an etcd cluster with 4 members, the 4th member was the etcd-bootstrap. No idea why etcd-bootstrap would still be around by the time you initiate an update. But sounds like a different bug from this one. Can you file a new bug and provide more details (e.g. linking to a must-gather, or giving steps to reproduce)?
No need to open a new BZ. Look at https://access.redhat.com/solutions/5161361 | https://bugzilla.redhat.com/show_bug.cgi?id=1832923 for further details about the bootstrap node bug.
> No need to open a new BZ. Look at https://access.redhat.com/solutions/5161361 | https://bugzilla.redhat.com/show_bug.cgi?id=1832923 for further details about the bootstrap node bug. Well no that is not true this is 100% a bug if the report is accurate. I have never seen this in CI myself.
Thanks W. Trevor, Andreas and Sam, can be that the etcd-bootstrap member was there since installation? I'm trying to get a must-gather tarball and I'll open a new BZ.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409