Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1856272

Summary:	NodeControllerDegraded after upgrade to version 4.4.10
Product:	OpenShift Container Platform	Reporter:	mchebbi <mchebbi>
Component:	Etcd Operator	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.4	CC:	dmace, wlewis
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-12 19:45:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description mchebbi@redhat.com 2020-07-13 08:58:53 UTC

The customer did an upgraded on his cluster to version 4.4.10 but he is still receiving the follwing alerts 7 times in the last 2 days.

Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"



--------------------------------------------------------------------------------------
 lastTimestamp: "2020-07-01T05:25:48Z"
  message: 'Status for clusteroperator/etcd changed: Degraded message changed from
    "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy
    members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc
    = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes
    are ready\nEtcdMembersDegraded: No unhealthy members found"'
  metadata:
    creationTimestamp: "2020-07-01T00:26:20Z"
    name: etcd-operator.161b801bdd566a58
    namespace: openshift-etcd-operator
    resourceVersion: "108066779"
    selfLink: /api/v1/namespaces/openshift-etcd-operator/events/etcd-operator.161b801bdd566a58
    uid: 9467eedf-2e02-4995-9106-4e0d84c15c83
  reason: OperatorStatusChanged
-------------------------------------------------------------------------------

I have checked the ETCD members and they were healthy. 

================================================================================================
sh-4.2# etcdctl member list -w table
+------------------+---------+---------+-------------------------+-------------------------+
|        ID        | STATUS  |  NAME   |       PEER ADDRS        |      CLIENT ADDRS       |
+------------------+---------+---------+-------------------------+-------------------------+
| 49290302d1cf0689 | started | master1 | https://10.30.3.38:2380 | https://10.30.3.38:2379 |
| 6d76a590df68149f | started | master2 | https://10.30.3.39:2380 | https://10.30.3.39:2379 |
| bc6d62c4102e12b2 | started | master0 | https://10.30.3.37:2380 | https://10.30.3.37:2379 |
+------------------+---------+---------+-------------------------+-------------------------+

sh-4.2# etcdctl endpoint status -w table
+-------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.30.3.37:2379 | bc6d62c4102e12b2 |  3.3.18 |  209 MB |      true |       118 |  316201143 |
| https://10.30.3.38:2379 | 49290302d1cf0689 |  3.3.18 |  209 MB |     false |       118 |  316201143 |
| https://10.30.3.39:2379 | 6d76a590df68149f |  3.3.18 |  209 MB |     false |       118 |  316201143 |
+-------------------------+------------------+---------+---------+-----------+-----------+------------+

sh-4.2# etcdctl endpoint health
https://10.30.3.37:2379 is healthy: successfully committed proposal: took = 9.707936ms
https://10.30.3.39:2379 is healthy: successfully committed proposal: took = 9.497324ms
https://10.30.3.38:2379 is healthy: successfully committed proposal: took = 9.972507ms

sh-4.2# etcdctl alarm list
sh-4.2# 

sh-4.2# etcdctl check perf
 60 / 60 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1m0s
PASS: Throughput is 150 writes/s
PASS: Slowest request took 0.097818s
PASS: Stddev is 0.003208s
PASS
===============================================================================

I have asked him to force a new revision on etcd as stated on this bugzilla.

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
Verify the nodes are at the latest revision:

$ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'


but this doesn't fix it. do you have please any recommendation on how to fix this issue ? 

you could find must-gather and other relevant informations in this link: https://cutt.ly/vpOCWd2

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1832986

Comment 2 mchebbi@redhat.com 2020-07-13 12:58:55 UTC

Ok thanks.
waiting for your feedback.

Comment 4 Dan Mace 2020-08-11 19:32:47 UTC

Although I'm digging a little further to understand why that EtcdMemberIPMigratorDegraded status could be mis-reporting, I'm downgrading the bug priority and severity of the bug because there's no evidence of any functional problem here.

The event itself is emitted at a "Normal" severity, and according the must-gather the cluster and etcd is healthy. There are no Prometheus alerts firing (that I know of). So far, at worst, this seems like some undesirable event spam.

Is there a functional issue here to justify a higher severity bug?

Comment 5 Dan Mace 2020-08-12 14:57:13 UTC

Another question: does the count of `OperatorStatusChanged` in the `openshift-etcd-operator` namespace matching this pattern continue to increase over time? Or does it stop at the given 22 events? The events I'm interested in have a message containing "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing".

Comment 6 Dan Mace 2020-08-12 19:45:47 UTC

I've learned this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1851351 — we can move the discussion over there.

*** This bug has been marked as a duplicate of bug 1851351 ***

Comment 7 Red Hat Bugzilla 2023-09-14 06:03:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days