Bug 1856272

Summary: NodeControllerDegraded after upgrade to version 4.4.10
Product: OpenShift Container Platform Reporter: mchebbi <mchebbi>
Component: Etcd OperatorAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: low Docs Contact:
Priority: low    
Version: 4.4CC: dmace, wlewis
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-12 19:45:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mchebbi@redhat.com 2020-07-13 08:58:53 UTC
The customer did an upgraded on his cluster to version 4.4.10 but he is still receiving the follwing alerts 7 times in the last 2 days.

Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"



--------------------------------------------------------------------------------------
 lastTimestamp: "2020-07-01T05:25:48Z"
  message: 'Status for clusteroperator/etcd changed: Degraded message changed from
    "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy
    members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc
    = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes
    are ready\nEtcdMembersDegraded: No unhealthy members found"'
  metadata:
    creationTimestamp: "2020-07-01T00:26:20Z"
    name: etcd-operator.161b801bdd566a58
    namespace: openshift-etcd-operator
    resourceVersion: "108066779"
    selfLink: /api/v1/namespaces/openshift-etcd-operator/events/etcd-operator.161b801bdd566a58
    uid: 9467eedf-2e02-4995-9106-4e0d84c15c83
  reason: OperatorStatusChanged
-------------------------------------------------------------------------------

I have checked the ETCD members and they were healthy. 

================================================================================================
sh-4.2# etcdctl member list -w table
+------------------+---------+---------+-------------------------+-------------------------+
|        ID        | STATUS  |  NAME   |       PEER ADDRS        |      CLIENT ADDRS       |
+------------------+---------+---------+-------------------------+-------------------------+
| 49290302d1cf0689 | started | master1 | https://10.30.3.38:2380 | https://10.30.3.38:2379 |
| 6d76a590df68149f | started | master2 | https://10.30.3.39:2380 | https://10.30.3.39:2379 |
| bc6d62c4102e12b2 | started | master0 | https://10.30.3.37:2380 | https://10.30.3.37:2379 |
+------------------+---------+---------+-------------------------+-------------------------+

sh-4.2# etcdctl endpoint status -w table
+-------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.30.3.37:2379 | bc6d62c4102e12b2 |  3.3.18 |  209 MB |      true |       118 |  316201143 |
| https://10.30.3.38:2379 | 49290302d1cf0689 |  3.3.18 |  209 MB |     false |       118 |  316201143 |
| https://10.30.3.39:2379 | 6d76a590df68149f |  3.3.18 |  209 MB |     false |       118 |  316201143 |
+-------------------------+------------------+---------+---------+-----------+-----------+------------+

sh-4.2# etcdctl endpoint health
https://10.30.3.37:2379 is healthy: successfully committed proposal: took = 9.707936ms
https://10.30.3.39:2379 is healthy: successfully committed proposal: took = 9.497324ms
https://10.30.3.38:2379 is healthy: successfully committed proposal: took = 9.972507ms

sh-4.2# etcdctl alarm list
sh-4.2# 

sh-4.2# etcdctl check perf
 60 / 60 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1m0s
PASS: Throughput is 150 writes/s
PASS: Slowest request took 0.097818s
PASS: Stddev is 0.003208s
PASS
===============================================================================

I have asked him to force a new revision on etcd as stated on this bugzilla.

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
Verify the nodes are at the latest revision:

$ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'


but this doesn't fix it. do you have please any recommendation on how to fix this issue ? 

you could find must-gather and other relevant informations in this link: https://cutt.ly/vpOCWd2

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1832986

Comment 2 mchebbi@redhat.com 2020-07-13 12:58:55 UTC
Ok thanks.
waiting for your feedback.

Comment 4 Dan Mace 2020-08-11 19:32:47 UTC
Although I'm digging a little further to understand why that EtcdMemberIPMigratorDegraded status could be mis-reporting, I'm downgrading the bug priority and severity of the bug because there's no evidence of any functional problem here.

The event itself is emitted at a "Normal" severity, and according the must-gather the cluster and etcd is healthy. There are no Prometheus alerts firing (that I know of). So far, at worst, this seems like some undesirable event spam.

Is there a functional issue here to justify a higher severity bug?

Comment 5 Dan Mace 2020-08-12 14:57:13 UTC
Another question: does the count of `OperatorStatusChanged` in the `openshift-etcd-operator` namespace matching this pattern continue to increase over time? Or does it stop at the given 22 events? The events I'm interested in have a message containing "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing".

Comment 6 Dan Mace 2020-08-12 19:45:47 UTC
I've learned this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1851351 — we can move the discussion over there.

*** This bug has been marked as a duplicate of bug 1851351 ***

Comment 7 Red Hat Bugzilla 2023-09-14 06:03:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days