Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1946784

Summary: EtcdMembers_UnhealthyMembers in 4.8 update CI
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Suresh Kolichala <skolicha>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: unspecified Docs Contact:
Priority: low    
Version: 4.8CC: skolicha
Target Milestone: ---Keywords: Upgrades
Target Release: ---Flags: mfojtik: needinfo?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleReset
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-07 16:58:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-04-06 20:07:40 UTC
Lots of these in 4.8 and later CI:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=reason/EtcdMembers_UnhealthyMembers' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 18 runs, 56% failed, 160% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 16 runs, 94% failed, 107% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade (all) - 5 runs, 40% failed, 100% of failures match = 40% impact
...
pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-17190-periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 7 runs, 71% failed, 40% of failures match = 29% impact

Picking [1] to dig into (4.8.0-0.ci-2021-04-03-201542 -> 4.8.0-0.ci-2021-04-05-224633):

  : [bz-Etcd] clusteroperator/etcd should not change condition/Degraded
  Run #0: Failed	0s
  4 unexpected clusteroperator state transitions during e2e test run 

  Apr 06 03:36:29.819 - 45s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-147-58.us-west-2.compute.internal is unhealthy
  Apr 06 04:22:20.784 - 62s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-147-58.us-west-2.compute.internal is unhealthy\nNodeControllerDegraded: The master nodes not ready: node "ip-10-0-147-58.us-west-2.compute.internal" not ready since 2021-04-06 04:20:15 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
  Apr 06 04:28:27.306 - 98s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-162-101.us-west-2.compute.internal is unhealthy\nNodeControllerDegraded: The master nodes not ready: node "ip-10-0-162-101.us-west-2.compute.internal" not ready since 2021-04-06 04:26:27 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
  Apr 06 04:34:55.469 - 66s   E clusteroperator/etcd condition/Degraded status/True reason/EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-195-226.us-west-2.compute.internal is unhealthy\nNodeControllerDegraded: The master nodes not ready: node "ip-10-0-195-226.us-west-2.compute.internal" not ready since 2021-04-06 04:32:55 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)

Checking the lovely new interval chart [2], we can see that those correspond to the three control-plane nodes rebooting towards the end of the update.  And also that there are a whole lot of other sad things going on.  I'm filing this against etcd, but obviously feel free to redirect if the underlying issue is in another component.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1379265480375668736
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1379265480375668736/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/e2e-intervals.html

Comment 1 Michal Fojtik 2021-05-06 20:14:28 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 W. Trevor King 2021-05-06 20:37:16 UTC
Not as bad as when I opened the bug, but we still see these in CI today:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=reason/EtcdMembers_UnhealthyMembers' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-18322-periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
rehearse-18336-periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
rehearse-18383-periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact

Comment 3 Michal Fojtik 2021-05-06 21:14:35 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 5 Suresh Kolichala 2021-05-07 16:58:58 UTC
This problem is addressed by the recently implemented fix for bug https://bugzilla.redhat.com/show_bug.cgi?id=1952268. 

It appears rebooting a node and getting it to the point where the kubelet is running static pods can take more than the default 2 minutes. This causes etcd operator to set Degraded=True on healthy machine-config node reboots.

As a workaround, we set a custom interia of 5 minute duration on the NodeControllerDegraded and EtcdMembersDegraded conditions.

Closing as a duplicate the above bug.

*** This bug has been marked as a duplicate of bug 1952268 ***

Comment 6 W. Trevor King 2021-05-08 02:40:31 UTC
With bug 1952268 landed a few days ago, confirming that CI isn't seeing this reason as often:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=reason/EtcdMembers_UnhealthyMembers' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 90% failed, 6% of failures match = 5% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 11% of failures match = 11% impact
pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-upgrade (all) - 32 runs, 84% failed, 4% of failures match = 3% impact
pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade (all) - 29 runs, 90% failed, 4% of failures match = 3% impact
rehearse-18322-pull-ci-openshift-cluster-etcd-operator-release-4.9-e2e-gcp-disruptive-ovn (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
rehearse-18409-pull-ci-openshift-cluster-etcd-operator-release-4.9-e2e-gcp-disruptive-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

So some rehearsals and PR presubmits (where all sorts of things can break), some 4.7->4.8 updates (4.7 hasn't been fixed yet), and... maybe periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade is still in here because we haven't accepted a candidate CI build since the fix landed in master/4.8?  Anyhow, looks much better now, and possibly completely fixed in 4.8; thanks :)