The customer did an upgraded on his cluster to version 4.4.10 but he is still receiving the follwing alerts 7 times in the last 2 days. Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" -------------------------------------------------------------------------------------- lastTimestamp: "2020-07-01T05:25:48Z" message: 'Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"' metadata: creationTimestamp: "2020-07-01T00:26:20Z" name: etcd-operator.161b801bdd566a58 namespace: openshift-etcd-operator resourceVersion: "108066779" selfLink: /api/v1/namespaces/openshift-etcd-operator/events/etcd-operator.161b801bdd566a58 uid: 9467eedf-2e02-4995-9106-4e0d84c15c83 reason: OperatorStatusChanged ------------------------------------------------------------------------------- I have checked the ETCD members and they were healthy. ================================================================================================ sh-4.2# etcdctl member list -w table +------------------+---------+---------+-------------------------+-------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+---------+-------------------------+-------------------------+ | 49290302d1cf0689 | started | master1 | https://10.30.3.38:2380 | https://10.30.3.38:2379 | | 6d76a590df68149f | started | master2 | https://10.30.3.39:2380 | https://10.30.3.39:2379 | | bc6d62c4102e12b2 | started | master0 | https://10.30.3.37:2380 | https://10.30.3.37:2379 | +------------------+---------+---------+-------------------------+-------------------------+ sh-4.2# etcdctl endpoint status -w table +-------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +-------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://10.30.3.37:2379 | bc6d62c4102e12b2 | 3.3.18 | 209 MB | true | 118 | 316201143 | | https://10.30.3.38:2379 | 49290302d1cf0689 | 3.3.18 | 209 MB | false | 118 | 316201143 | | https://10.30.3.39:2379 | 6d76a590df68149f | 3.3.18 | 209 MB | false | 118 | 316201143 | +-------------------------+------------------+---------+---------+-----------+-----------+------------+ sh-4.2# etcdctl endpoint health https://10.30.3.37:2379 is healthy: successfully committed proposal: took = 9.707936ms https://10.30.3.39:2379 is healthy: successfully committed proposal: took = 9.497324ms https://10.30.3.38:2379 is healthy: successfully committed proposal: took = 9.972507ms sh-4.2# etcdctl alarm list sh-4.2# sh-4.2# etcdctl check perf 60 / 60 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1m0s PASS: Throughput is 150 writes/s PASS: Slowest request took 0.097818s PASS: Stddev is 0.003208s PASS =============================================================================== I have asked him to force a new revision on etcd as stated on this bugzilla. $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge Verify the nodes are at the latest revision: $ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' but this doesn't fix it. do you have please any recommendation on how to fix this issue ? you could find must-gather and other relevant informations in this link: https://cutt.ly/vpOCWd2 [1] - https://bugzilla.redhat.com/show_bug.cgi?id=1832986
Ok thanks. waiting for your feedback.
Although I'm digging a little further to understand why that EtcdMemberIPMigratorDegraded status could be mis-reporting, I'm downgrading the bug priority and severity of the bug because there's no evidence of any functional problem here. The event itself is emitted at a "Normal" severity, and according the must-gather the cluster and etcd is healthy. There are no Prometheus alerts firing (that I know of). So far, at worst, this seems like some undesirable event spam. Is there a functional issue here to justify a higher severity bug?
Another question: does the count of `OperatorStatusChanged` in the `openshift-etcd-operator` namespace matching this pattern continue to increase over time? Or does it stop at the given 22 events? The events I'm interested in have a message containing "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing".
I've learned this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1851351 — we can move the discussion over there. *** This bug has been marked as a duplicate of bug 1851351 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days