Similar to the 4.5 bug 1832986, recent 4.9.0 -> 4.9.nightly -> 4.9.0 job failed with [1]: : [sig-arch] events should not repeat pathologically 0s 2 events happened too frequently event happened 32 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" event happened 29 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" So in those, the flapping bit is whether EtcdEndpointsDegraded is included. This job picks a random point to abort the first leg of the update, and in this case, etcd had made it through to the nightly target, and we'd gotten far enough past that to ask the machine API operator to update to the nightly, before we turned around and started heading back to 4.9.0. I dunno if that's relevant or not. Pulling the changes themselves out of the full build log: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade-rollback-oldest-supported/1450920777342783488/build-log.txt | grep 'Status for clusteroperator/etcd' ... Oct 20 21:39:27.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-210-141.us-west-1.compute.internal is unhealthy\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-210-141.us-west-1.compute.internal is unhealthy\nStaticPodsDegraded: pod/etcd-ip-10-0-210-141.us-west-1.compute.internal container \"etcd\" started at 2021-10-20 21:38:38 +0000 UTC is still not ready\nEtcdMembersDegraded: No unhealthy members found" Oct 20 21:39:35.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-210-141.us-west-1.compute.internal is unhealthy\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" Oct 20 21:39:35.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-210-141.us-west-1.compute.internal is unhealthy\nStaticPodsDegraded: pod/etcd-ip-10-0-210-141.us-west-1.compute.internal container \"etcd\" started at 2021-10-20 21:38:38 +0000 UTC is still not ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, ip-10-0-210-141.us-west-1.compute.internal is unhealthy\nEtcdMembersDegraded: No unhealthy members found" Oct 20 21:39:35.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Progressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 5\nEtcdMembersProgressing: No unstarted etcd members found"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 4; 2 nodes are at revision 5\nEtcdMembersAvailable: 3 members are available" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 5\nEtcdMembersAvailable: 3 members are available" Oct 20 22:07:35.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" Oct 20 22:07:35.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: failed to dial endpoint https://10.0.210.141:2379 with maintenance client: context canceled\nEtcdMembersDegraded: No unhealthy members found" (3 times) Oct 20 22:07:35.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (2 times) Oct 20 22:07:37.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: failed to dial endpoint https://10.0.210.141:2379 with maintenance client: context canceled\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" Oct 20 22:07:38.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" Oct 20 22:07:38.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (3 times) ... Oct 20 22:21:50.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (6 times) Oct 20 22:21:52.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" (7 times) Oct 20 22:21:52.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (7 times) Oct 20 22:22:09.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" (29 times) Oct 20 22:22:09.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (32 times) Oct 20 22:22:10.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" (8 times) Oct 20 22:22:10.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nBootstrapTeardownDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (8 times) Oct 20 22:22:26.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nClusterMemberControllerDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" (14 times) Oct 20 22:22:26.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: failed to dial endpoint https://10.0.210.141:2379 with maintenance client: context canceled\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" (12 times) Oct 20 22:22:26.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nClusterMemberControllerDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (12 times) Oct 20 22:22:26.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: failed to dial endpoint https://10.0.210.141:2379 with maintenance client: context canceled\nEtcdMembersDegraded: No unhealthy members found" (13 times) Oct 20 22:23:21.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nClusterMemberControllerDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" (15 times) Oct 20 22:23:21.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nClusterMemberControllerDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" (13 times) Oct 20 22:29:27.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: failed to dial endpoint https://10.0.191.254:2379 with maintenance client: context canceled\nEtcdMembersDegraded: No unhealthy members found" (3 times) Oct 20 22:29:39.000 I ns/openshift-etcd-operator deployment/etcd-operator reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nDefragControllerDegraded: failed to dial endpoint https://10.0.191.254:2379 with maintenance client: context canceled\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" (2 times) So a few rounds of churn, but they all quieted down, until around 22:07:35Z, after which we were noisy for the next 20+ minutes. 22:07 corresponds to the start of the conformance suite we ran after returning to 4.9.0, so it's possible that some of the silence was the short gap between the update suite exiting and the post-update conformance suite launching. Searching for similar symptoms in other jobs, in case that's helpful: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=120h&type=junit&search=event+happened.*times.*something+is+wrong.*deployment/etcd-operator.*Degraded+message+changed.*EndpointsDegraded' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 102 runs, 69% failed, 1% of failures match = 1% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 358 runs, 91% failed, 0% of failures match = 0% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact (all) - 3 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-azure (all) - 16 runs, 31% failed, 20% of failures match = 6% impact pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic (all) - 6 runs, 50% failed, 33% of failures match = 17% impact rehearse-22260-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-techpreview (all) - 8 runs, 75% failed, 17% of failures match = 13% impact so seems like a 4.9+ think (but it's also possible that's an artifact of the test suite; I'm not sure when it learned to monitor for repeated events).
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.