1990041 – [sig-arch] events should not repeat pathologically - event happened more than 20 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/UnhealthyEtcdMember

Bug 1990041 - [sig-arch] events should not repeat pathologically - event happened more than 20 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/UnhealthyEtcdMember

Summary: [sig-arch] events should not repeat pathologically - event happened more than...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:	tag-ci
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-04 15:41 UTC by Jan Chaloupka
Modified:	2021-08-18 19:14 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-18 19:14:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 26409	0	None	None	None	2021-08-18 19:03:09 UTC

Description Jan Chaloupka 2021-08-04 15:41:43 UTC

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422826822621466624:

```
[sig-arch] events should not repeat pathologically
1 events happened too frequently

event happened 25 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/UnhealthyEtcdMember unhealthy members: ip-10-0-250-95.us-west-1.compute.internal
```

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422826822621466624/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json:

```
        {
            "apiVersion": "v1",
            "count": 25,
            "eventTime": null,
            "firstTimestamp": "2021-08-04T08:41:56Z",
            "involvedObject": {
                "apiVersion": "apps/v1",
                "kind": "Deployment",
                "name": "etcd-operator",
                "namespace": "openshift-etcd-operator",
                "uid": "293339f4-ba33-430e-a5eb-f4a3b2dd3575"
            },
            "kind": "Event",
            "lastTimestamp": "2021-08-04T09:54:48Z",
            "message": "unhealthy members: ip-10-0-250-95.us-west-1.compute.internal",
            "metadata": {
                "creationTimestamp": "2021-08-04T08:41:56Z",
                "name": "etcd-operator.16980ddefe68fadd",
                "namespace": "openshift-etcd-operator",
                "resourceVersion": "82387",
                "uid": "2c78dbb5-71b5-404c-961c-1902f360e23a"
            },
            "reason": "UnhealthyEtcdMember",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "openshift-cluster-etcd-operator-etcd-client"
            },
            "type": "Warning"
        },

```

The time of the incident between 2021-08-04T08:41:56Z and 2021-08-04T09:54:48Z.

Checking loki logs from https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221628028000000%22,%221628114399000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade%2F1422826822621466624%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-etcd-operator%5C%22%20%7C%20container_name%3D%5C%22etcd-operator%5C%22%22,%22instant%22:false,%22range%22:true,%22hide%22:false%7D%5D:

```
2021-08-04 10:41:56
I0804 08:41:56.696750       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-250-95.us-west-1.compute.internal
...
2021-08-04 11:52:39
I0804 09:52:38.938714       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-250-95.us-west-1.compute.internal
...
2021-08-04 11:52:41
E0804 09:52:41.393702       1 base_controller.go:266] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy
2021-08-04 11:49:41

I0804 09:49:41.234408       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-250-95.us-west-1.compute.internal\" not ready since 2021-08-04 09:49:26 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-250-95.us-west-1.compute.internal\" not ready since 2021-08-04 09:49:26 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy\nDefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy"

2021-08-04 11:49:26
I0804 09:49:26.138075       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-04T08:11:56Z","message":"NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-250-95.us-west-1.compute.internal\" not ready since 2021-08-04 09:49:26 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-08-04T08:42:48Z","message":"NodeInstallerProgressing: 3 nodes are at revision 4\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-08-04T08:11:09Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 3 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-08-04T08:08:54Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

2021-08-04 11:48:44
I0804 09:48:44.580577       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-250-95.us-west-1.compute.internal

2021-08-04 11:23:22
I0804 09:23:22.741757       1 defragcontroller.go:189] etcd member "ip-10-0-187-35.us-west-1.compute.internal" backend store fragmented: 6.29 %, dbSize: 115445760
2021-08-04 11:23:22
I0804 09:23:22.741752       1 defragcontroller.go:189] etcd member "ip-10-0-128-86.us-west-1.compute.internal" backend store fragmented: 5.64 %, dbSize: 114696192
2021-08-04 11:23:22
I0804 09:23:22.741733       1 defragcontroller.go:189] etcd member "ip-10-0-250-95.us-west-1.compute.internal" backend store fragmented: 5.61 %, dbSize: 114618368

2021-08-04 10:56:22
I0804 08:56:22.777603       1 defragcontroller.go:189] etcd member "ip-10-0-187-35.us-west-1.compute.internal" backend store fragmented: 42.24 %, dbSize: 115445760
2021-08-04 10:56:22
I0804 08:56:22.777595       1 defragcontroller.go:189] etcd member "ip-10-0-128-86.us-west-1.compute.internal" backend store fragmented: 41.85 %, dbSize: 114696192
2021-08-04 10:56:22
I0804 08:56:22.777567       1 defragcontroller.go:189] etcd member "ip-10-0-250-95.us-west-1.compute.internal" backend store fragmented: 41.80 %, dbSize: 114618368

2021-08-04 10:46:23
I0804 08:46:22.927737       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdLeaderChangeMetrics' Detected leader change increase of 1.25 over 5 minutes on "AWS"; disk metrics are: loki-promtail-t757w=NaN,loki-promtail-k664z=NaN,loki-promtail-tqmxs=NaN,loki-promtail-d48lg=NaN,loki-promtail-h9z8m=NaN,loki-promtail-95tkw=NaN,etcd-ip-10-0-128-86.us-west-1.compute.internal=0.006495000000000001,etcd-ip-10-0-187-35.us-west-1.compute.internal=0.01728000000000013,etcd-ip-10-0-250-95.us-west-1.compute.internal=0.003990000000000001
2021-08-04 10:43:22
I0804 08:43:22.658070       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 3 members are available"

2021-08-04 10:43:22
I0804 08:43:22.645010       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 3 members are available"

2021-08-04 10:42:49
I0804 08:42:49.015798       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Progressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 4\nEtcdMembersProgressing: No unstarted etcd members found"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 3; 2 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy"
```

The logs reports ip-10-0-250-95.us-west-1.compute.internal instance is unhealthy. Checking the kubelet logs from the time of the incident:

```
Aug 04 08:43:21.282671 ip-10-0-250-95 hyperkube[1621]: I0804 08:43:21.282673    1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running
Aug 04 08:44:21.283595 ip-10-0-250-95 hyperkube[1621]: I0804 08:44:21.283568    1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running
...
Aug 04 09:47:21.331877 ip-10-0-250-95 hyperkube[1621]: I0804 09:47:21.331893    1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running
Aug 04 09:48:21.332218 ip-10-0-250-95 hyperkube[1621]: I0804 09:48:21.332229    1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running
```

No "Probe failed" line in the logs between 08:43:21.282671 and 09:47:21.331877. Unfortunately, there are no log from the etcd-ip-10-0-250-95.us-west-1.compute.internal pod itself in the Loki.

Based on implementation of https://github.com/openshift/cluster-etcd-operator/blob/c2226ad8bae2a24f5aec070684d3f6854582f322/pkg/etcdcli/etcdcli.go#L312-L339 where UnhealthyEtcdMember keyword is defined, the etcd container may be running properly (probes not failing), yet the etcd member itself may not be healthy.

Checking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422886137982095360:
```
        {
            "apiVersion": "v1",
            "count": 25,
            "eventTime": null,
            "firstTimestamp": "2021-08-04T12:56:13Z",
            "involvedObject": {
                "apiVersion": "apps/v1",
                "kind": "Deployment",
                "name": "etcd-operator",
                "namespace": "openshift-etcd-operator",
                "uid": "09374f05-acc1-4b12-883b-8ca500c6e6bc"
            },
            "kind": "Event",
            "lastTimestamp": "2021-08-04T14:06:33Z",
            "message": "unhealthy members: ip-10-0-212-15.us-west-1.compute.internal",
            "metadata": {
                "creationTimestamp": "2021-08-04T12:56:13Z",
                "name": "etcd-operator.16981bbf4b52a910",
                "namespace": "openshift-etcd-operator",
                "resourceVersion": "81661",
                "uid": "9886dd81-a3d2-436f-b5b9-38097ce2feb6"
            },
            "reason": "UnhealthyEtcdMember",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "openshift-cluster-etcd-operator-etcd-client"
            },
            "type": "Warning"
        },
        {
            "apiVersion": "v1",
            "count": 22,
            "eventTime": null,
            "firstTimestamp": "2021-08-04T12:57:38Z",
            "involvedObject": {
                "apiVersion": "apps/v1",
                "kind": "Deployment",
                "name": "etcd-operator",
                "namespace": "openshift-etcd-operator",
                "uid": "09374f05-acc1-4b12-883b-8ca500c6e6bc"
            },
            "kind": "Event",
            "lastTimestamp": "2021-08-04T14:15:21Z",
            "message": "unhealthy members: ip-10-0-139-118.us-west-1.compute.internal",
            "metadata": {
                "creationTimestamp": "2021-08-04T12:57:38Z",
                "name": "etcd-operator.16981bd321d3dfdb",
                "namespace": "openshift-etcd-operator",
                "resourceVersion": "89631",
                "uid": "079f5b63-2fb4-4397-a00c-14050e3632d6"
            },
            "reason": "UnhealthyEtcdMember",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "openshift-cluster-etcd-operator-etcd-client"
            },
            "type": "Warning"
        },

```

ip-10-0-212-15.us-west-1.compute.internal <2021-08-04T12:56:13Z; 2021-08-04T14:06:33Z>
ip-10-0-139-118.us-west-1.compute.internal <2021-08-04T12:57:38Z; 2021-08-04T14:15:21Z>

Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422886137982095360/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes/ip-10-0-212-15.us-west-1.compute.internal/ in <2021-08-04T12:56:13Z; 2021-08-04T14:06:33Z> interval, the etcd pod is Running. No loki logs for the etcd pods either.

Checking etcd logs there are provided in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422886137982095360/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ip-10-0-212-15.us-west-1.compute.internal_etcd.log one can see:
```
{"level":"warn","ts":"2021-08-04T14:09:44.247Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d47948009c5bb3e4","rtt":"11.990056ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"}
{"level":"warn","ts":"2021-08-04T14:09:44.248Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d47948009c5bb3e4","rtt":"1.908424ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"}
{"level":"info","ts":"2021-08-04T14:09:44.699Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":80099}
{"level":"info","ts":"2021-08-04T14:09:44.994Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":80099,"took":"287.323286ms"}
{"level":"warn","ts":"2021-08-04T14:09:49.247Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d47948009c5bb3e4","rtt":"11.990056ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"}
{"level":"warn","ts":"2021-08-04T14:09:49.248Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-
...
{"level":"warn","ts":"2021-08-04T14:15:19.258Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d47948009c5bb3e4","rtt":"11.990056ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"}
{"level":"warn","ts":"2021-08-04T14:15:19.258Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d47948009c5bb3e4","rtt":"1.908424ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"}
```

Something that might be happening in the time of the incidents.

Other jobs from https://search.ci.openshift.org/?search=events+should+not+repeat+pathologically&maxAge=48h&context=3&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job (search for UnhealthyEtcdMember keyword) are reporting the same (the biggest offenders):
- https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade
- https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade
- https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade

Happening more during upgrades than non-upgrades.

Comment 1 Stephen Benjamin 2021-08-18 19:14:10 UTC

So far we've only seen this on upgrades, and it's expected -- this was added to the skip cases in https://github.com/openshift/origin/commit/de8df4de72a40bc6bc561dc9562d49d50ec45e54

Note You need to log in before you can comment on or make changes to this bug.