From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422826822621466624: ``` [sig-arch] events should not repeat pathologically 1 events happened too frequently event happened 25 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/UnhealthyEtcdMember unhealthy members: ip-10-0-250-95.us-west-1.compute.internal ``` From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422826822621466624/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json: ``` { "apiVersion": "v1", "count": 25, "eventTime": null, "firstTimestamp": "2021-08-04T08:41:56Z", "involvedObject": { "apiVersion": "apps/v1", "kind": "Deployment", "name": "etcd-operator", "namespace": "openshift-etcd-operator", "uid": "293339f4-ba33-430e-a5eb-f4a3b2dd3575" }, "kind": "Event", "lastTimestamp": "2021-08-04T09:54:48Z", "message": "unhealthy members: ip-10-0-250-95.us-west-1.compute.internal", "metadata": { "creationTimestamp": "2021-08-04T08:41:56Z", "name": "etcd-operator.16980ddefe68fadd", "namespace": "openshift-etcd-operator", "resourceVersion": "82387", "uid": "2c78dbb5-71b5-404c-961c-1902f360e23a" }, "reason": "UnhealthyEtcdMember", "reportingComponent": "", "reportingInstance": "", "source": { "component": "openshift-cluster-etcd-operator-etcd-client" }, "type": "Warning" }, ``` The time of the incident between 2021-08-04T08:41:56Z and 2021-08-04T09:54:48Z. Checking loki logs from https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221628028000000%22,%221628114399000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade%2F1422826822621466624%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-etcd-operator%5C%22%20%7C%20container_name%3D%5C%22etcd-operator%5C%22%22,%22instant%22:false,%22range%22:true,%22hide%22:false%7D%5D: ``` 2021-08-04 10:41:56 I0804 08:41:56.696750 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-250-95.us-west-1.compute.internal ... 2021-08-04 11:52:39 I0804 09:52:38.938714 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-250-95.us-west-1.compute.internal ... 2021-08-04 11:52:41 E0804 09:52:41.393702 1 base_controller.go:266] DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy 2021-08-04 11:49:41 I0804 09:49:41.234408 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-250-95.us-west-1.compute.internal\" not ready since 2021-08-04 09:49:26 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-250-95.us-west-1.compute.internal\" not ready since 2021-08-04 09:49:26 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy\nDefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" 2021-08-04 11:49:26 I0804 09:49:26.138075 1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-04T08:11:56Z","message":"NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-250-95.us-west-1.compute.internal\" not ready since 2021-08-04 09:49:26 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-08-04T08:42:48Z","message":"NodeInstallerProgressing: 3 nodes are at revision 4\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-08-04T08:11:09Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 3 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-08-04T08:08:54Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} 2021-08-04 11:48:44 I0804 09:48:44.580577 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-250-95.us-west-1.compute.internal 2021-08-04 11:23:22 I0804 09:23:22.741757 1 defragcontroller.go:189] etcd member "ip-10-0-187-35.us-west-1.compute.internal" backend store fragmented: 6.29 %, dbSize: 115445760 2021-08-04 11:23:22 I0804 09:23:22.741752 1 defragcontroller.go:189] etcd member "ip-10-0-128-86.us-west-1.compute.internal" backend store fragmented: 5.64 %, dbSize: 114696192 2021-08-04 11:23:22 I0804 09:23:22.741733 1 defragcontroller.go:189] etcd member "ip-10-0-250-95.us-west-1.compute.internal" backend store fragmented: 5.61 %, dbSize: 114618368 2021-08-04 10:56:22 I0804 08:56:22.777603 1 defragcontroller.go:189] etcd member "ip-10-0-187-35.us-west-1.compute.internal" backend store fragmented: 42.24 %, dbSize: 115445760 2021-08-04 10:56:22 I0804 08:56:22.777595 1 defragcontroller.go:189] etcd member "ip-10-0-128-86.us-west-1.compute.internal" backend store fragmented: 41.85 %, dbSize: 114696192 2021-08-04 10:56:22 I0804 08:56:22.777567 1 defragcontroller.go:189] etcd member "ip-10-0-250-95.us-west-1.compute.internal" backend store fragmented: 41.80 %, dbSize: 114618368 2021-08-04 10:46:23 I0804 08:46:22.927737 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdLeaderChangeMetrics' Detected leader change increase of 1.25 over 5 minutes on "AWS"; disk metrics are: loki-promtail-t757w=NaN,loki-promtail-k664z=NaN,loki-promtail-tqmxs=NaN,loki-promtail-d48lg=NaN,loki-promtail-h9z8m=NaN,loki-promtail-95tkw=NaN,etcd-ip-10-0-128-86.us-west-1.compute.internal=0.006495000000000001,etcd-ip-10-0-187-35.us-west-1.compute.internal=0.01728000000000013,etcd-ip-10-0-250-95.us-west-1.compute.internal=0.003990000000000001 2021-08-04 10:43:22 I0804 08:43:22.658070 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 3 members are available" 2021-08-04 10:43:22 I0804 08:43:22.645010 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 3 members are available" 2021-08-04 10:42:49 I0804 08:42:49.015798 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"293339f4-ba33-430e-a5eb-f4a3b2dd3575", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Progressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 4\nEtcdMembersProgressing: No unstarted etcd members found"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 3; 2 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4\nEtcdMembersAvailable: 2 of 3 members are available, ip-10-0-250-95.us-west-1.compute.internal is unhealthy" ``` The logs reports ip-10-0-250-95.us-west-1.compute.internal instance is unhealthy. Checking the kubelet logs from the time of the incident: ``` Aug 04 08:43:21.282671 ip-10-0-250-95 hyperkube[1621]: I0804 08:43:21.282673 1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running Aug 04 08:44:21.283595 ip-10-0-250-95 hyperkube[1621]: I0804 08:44:21.283568 1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running ... Aug 04 09:47:21.331877 ip-10-0-250-95 hyperkube[1621]: I0804 09:47:21.331893 1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running Aug 04 09:48:21.332218 ip-10-0-250-95 hyperkube[1621]: I0804 09:48:21.332229 1621 kubelet_getters.go:176] "Pod status updated" pod="openshift-etcd/etcd-ip-10-0-250-95.us-west-1.compute.internal" status=Running ``` No "Probe failed" line in the logs between 08:43:21.282671 and 09:47:21.331877. Unfortunately, there are no log from the etcd-ip-10-0-250-95.us-west-1.compute.internal pod itself in the Loki. Based on implementation of https://github.com/openshift/cluster-etcd-operator/blob/c2226ad8bae2a24f5aec070684d3f6854582f322/pkg/etcdcli/etcdcli.go#L312-L339 where UnhealthyEtcdMember keyword is defined, the etcd container may be running properly (probes not failing), yet the etcd member itself may not be healthy. Checking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422886137982095360: ``` { "apiVersion": "v1", "count": 25, "eventTime": null, "firstTimestamp": "2021-08-04T12:56:13Z", "involvedObject": { "apiVersion": "apps/v1", "kind": "Deployment", "name": "etcd-operator", "namespace": "openshift-etcd-operator", "uid": "09374f05-acc1-4b12-883b-8ca500c6e6bc" }, "kind": "Event", "lastTimestamp": "2021-08-04T14:06:33Z", "message": "unhealthy members: ip-10-0-212-15.us-west-1.compute.internal", "metadata": { "creationTimestamp": "2021-08-04T12:56:13Z", "name": "etcd-operator.16981bbf4b52a910", "namespace": "openshift-etcd-operator", "resourceVersion": "81661", "uid": "9886dd81-a3d2-436f-b5b9-38097ce2feb6" }, "reason": "UnhealthyEtcdMember", "reportingComponent": "", "reportingInstance": "", "source": { "component": "openshift-cluster-etcd-operator-etcd-client" }, "type": "Warning" }, { "apiVersion": "v1", "count": 22, "eventTime": null, "firstTimestamp": "2021-08-04T12:57:38Z", "involvedObject": { "apiVersion": "apps/v1", "kind": "Deployment", "name": "etcd-operator", "namespace": "openshift-etcd-operator", "uid": "09374f05-acc1-4b12-883b-8ca500c6e6bc" }, "kind": "Event", "lastTimestamp": "2021-08-04T14:15:21Z", "message": "unhealthy members: ip-10-0-139-118.us-west-1.compute.internal", "metadata": { "creationTimestamp": "2021-08-04T12:57:38Z", "name": "etcd-operator.16981bd321d3dfdb", "namespace": "openshift-etcd-operator", "resourceVersion": "89631", "uid": "079f5b63-2fb4-4397-a00c-14050e3632d6" }, "reason": "UnhealthyEtcdMember", "reportingComponent": "", "reportingInstance": "", "source": { "component": "openshift-cluster-etcd-operator-etcd-client" }, "type": "Warning" }, ``` ip-10-0-212-15.us-west-1.compute.internal <2021-08-04T12:56:13Z; 2021-08-04T14:06:33Z> ip-10-0-139-118.us-west-1.compute.internal <2021-08-04T12:57:38Z; 2021-08-04T14:15:21Z> Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422886137982095360/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes/ip-10-0-212-15.us-west-1.compute.internal/ in <2021-08-04T12:56:13Z; 2021-08-04T14:06:33Z> interval, the etcd pod is Running. No loki logs for the etcd pods either. Checking etcd logs there are provided in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1422886137982095360/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ip-10-0-212-15.us-west-1.compute.internal_etcd.log one can see: ``` {"level":"warn","ts":"2021-08-04T14:09:44.247Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d47948009c5bb3e4","rtt":"11.990056ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"} {"level":"warn","ts":"2021-08-04T14:09:44.248Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d47948009c5bb3e4","rtt":"1.908424ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"} {"level":"info","ts":"2021-08-04T14:09:44.699Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":80099} {"level":"info","ts":"2021-08-04T14:09:44.994Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":80099,"took":"287.323286ms"} {"level":"warn","ts":"2021-08-04T14:09:49.247Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d47948009c5bb3e4","rtt":"11.990056ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"} {"level":"warn","ts":"2021-08-04T14:09:49.248Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer- ... {"level":"warn","ts":"2021-08-04T14:15:19.258Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d47948009c5bb3e4","rtt":"11.990056ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"} {"level":"warn","ts":"2021-08-04T14:15:19.258Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d47948009c5bb3e4","rtt":"1.908424ms","error":"dial tcp 10.0.139.118:2380: connect: connection refused"} ``` Something that might be happening in the time of the incidents. Other jobs from https://search.ci.openshift.org/?search=events+should+not+repeat+pathologically&maxAge=48h&context=3&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job (search for UnhealthyEtcdMember keyword) are reporting the same (the biggest offenders): - https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade - https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade - https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade Happening more during upgrades than non-upgrades.
So far we've only seen this on upgrades, and it's expected -- this was added to the skip cases in https://github.com/openshift/origin/commit/de8df4de72a40bc6bc561dc9562d49d50ec45e54