Bug 1856272 - NodeControllerDegraded after upgrade to version 4.4.10
Summary: NodeControllerDegraded after upgrade to version 4.4.10
Keywords:
Status: CLOSED DUPLICATE of bug 1851351
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-13 08:58 UTC by mchebbi@redhat.com
Modified: 2023-09-14 06:03 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-12 19:45:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description mchebbi@redhat.com 2020-07-13 08:58:53 UTC
The customer did an upgraded on his cluster to version 4.4.10 but he is still receiving the follwing alerts 7 times in the last 2 days.

Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"



--------------------------------------------------------------------------------------
 lastTimestamp: "2020-07-01T05:25:48Z"
  message: 'Status for clusteroperator/etcd changed: Degraded message changed from
    "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy
    members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc
    = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes
    are ready\nEtcdMembersDegraded: No unhealthy members found"'
  metadata:
    creationTimestamp: "2020-07-01T00:26:20Z"
    name: etcd-operator.161b801bdd566a58
    namespace: openshift-etcd-operator
    resourceVersion: "108066779"
    selfLink: /api/v1/namespaces/openshift-etcd-operator/events/etcd-operator.161b801bdd566a58
    uid: 9467eedf-2e02-4995-9106-4e0d84c15c83
  reason: OperatorStatusChanged
-------------------------------------------------------------------------------

I have checked the ETCD members and they were healthy. 

================================================================================================
sh-4.2# etcdctl member list -w table
+------------------+---------+---------+-------------------------+-------------------------+
|        ID        | STATUS  |  NAME   |       PEER ADDRS        |      CLIENT ADDRS       |
+------------------+---------+---------+-------------------------+-------------------------+
| 49290302d1cf0689 | started | master1 | https://10.30.3.38:2380 | https://10.30.3.38:2379 |
| 6d76a590df68149f | started | master2 | https://10.30.3.39:2380 | https://10.30.3.39:2379 |
| bc6d62c4102e12b2 | started | master0 | https://10.30.3.37:2380 | https://10.30.3.37:2379 |
+------------------+---------+---------+-------------------------+-------------------------+

sh-4.2# etcdctl endpoint status -w table
+-------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.30.3.37:2379 | bc6d62c4102e12b2 |  3.3.18 |  209 MB |      true |       118 |  316201143 |
| https://10.30.3.38:2379 | 49290302d1cf0689 |  3.3.18 |  209 MB |     false |       118 |  316201143 |
| https://10.30.3.39:2379 | 6d76a590df68149f |  3.3.18 |  209 MB |     false |       118 |  316201143 |
+-------------------------+------------------+---------+---------+-----------+-----------+------------+

sh-4.2# etcdctl endpoint health
https://10.30.3.37:2379 is healthy: successfully committed proposal: took = 9.707936ms
https://10.30.3.39:2379 is healthy: successfully committed proposal: took = 9.497324ms
https://10.30.3.38:2379 is healthy: successfully committed proposal: took = 9.972507ms

sh-4.2# etcdctl alarm list
sh-4.2# 

sh-4.2# etcdctl check perf
 60 / 60 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%1m0s
PASS: Throughput is 150 writes/s
PASS: Slowest request took 0.097818s
PASS: Stddev is 0.003208s
PASS
===============================================================================

I have asked him to force a new revision on etcd as stated on this bugzilla.

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
Verify the nodes are at the latest revision:

$ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'


but this doesn't fix it. do you have please any recommendation on how to fix this issue ? 

you could find must-gather and other relevant informations in this link: https://cutt.ly/vpOCWd2

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1832986

Comment 2 mchebbi@redhat.com 2020-07-13 12:58:55 UTC
Ok thanks.
waiting for your feedback.

Comment 4 Dan Mace 2020-08-11 19:32:47 UTC
Although I'm digging a little further to understand why that EtcdMemberIPMigratorDegraded status could be mis-reporting, I'm downgrading the bug priority and severity of the bug because there's no evidence of any functional problem here.

The event itself is emitted at a "Normal" severity, and according the must-gather the cluster and etcd is healthy. There are no Prometheus alerts firing (that I know of). So far, at worst, this seems like some undesirable event spam.

Is there a functional issue here to justify a higher severity bug?

Comment 5 Dan Mace 2020-08-12 14:57:13 UTC
Another question: does the count of `OperatorStatusChanged` in the `openshift-etcd-operator` namespace matching this pattern continue to increase over time? Or does it stop at the given 22 events? The events I'm interested in have a message containing "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing".

Comment 6 Dan Mace 2020-08-12 19:45:47 UTC
I've learned this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1851351 — we can move the discussion over there.

*** This bug has been marked as a duplicate of bug 1851351 ***

Comment 7 Red Hat Bugzilla 2023-09-14 06:03:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.