Created attachment 1789310 [details] double-TERM stack trace from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096 Seen in 4.8.0-0.ci-2021-06-07-190057 CI [1]: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] Run #0: Failed 11s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 7 20:04:11.747: Unexpected alerts fired or pending after the test run: alert KubePodCrashLooping fired for 120 seconds with labels: {container="cluster-version-operator", endpoint="https-main", job="kube-state-metrics", namespace="openshift-cluster-version", pod="cluster-version-operator-f7749cd86-txqns", service="kube-state-metrics", severity="warning"} From PromeCIeus [2]: ALERTS{alertname="KubePodCrashLooping",alertstate="firing"} shows: ALERTS{alertname="KubePodCrashLooping", alertstate="firing", container="cluster-version-operator", endpoint="https-main", job="kube-state-metrics", namespace="openshift-cluster-version", pod="cluster-version-operator-f7749cd86-txqns", service="kube-state-metrics", severity="warning"} from ~19:40Z to ~19:41Z. Checking events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136/artifacts/e2e-gcp/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | (.firstTimestamp // .eventTime) + " " + .reason + ": " + .message' | sort | grep -B1 failed ... 2021-06-07T19:24:13Z PayloadLoaded: payload loaded version="4.8.0-0.ci-2021-06-07-190057" image="registry.build01.ci.openshift.org/ci-op-m1d1ty3z/release@sha256:f3ff2028f144ab39114209ca0938cd9ad0fc89bc597e8de76f5b3fd37da6a144" 2021-06-07T19:24:57Z BackOff: Back-off restarting failed container That's just a bit too old for us to have gathered the logs. The previous.log we did gather starts a few seconds later: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-f7749cd86-txqns_cluster-version-operator_previous.log | head -n1 I0607 19:25:08.928389 1 start.go:21] ClusterVersionOperator v1.0.0-545-g706a0062-dirty Seems fairly common: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=KubePodCrashLooping+fired+for.*seconds+with+labels.*cluster-version-operator' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 16 runs, 44% failed, 29% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 19 runs, 100% failed, 5% of failures match = 5% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws (all) - 12 runs, 25% failed, 33% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 9 runs, 89% failed, 13% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi (all) - 10 runs, 80% failed, 25% of failures match = 20% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi-serial (all) - 10 runs, 70% failed, 14% of failures match = 10% impact periodic-ci-openshift-release-master-okd-4.8-e2e-aws (all) - 7 runs, 71% failed, 20% of failures match = 14% impact periodic-ci-openshift-release-master-okd-4.8-e2e-vsphere (all) - 9 runs, 100% failed, 11% of failures match = 11% impact pull-ci-cri-o-cri-o-master-e2e-gcp (all) - 24 runs, 13% failed, 33% of failures match = 4% impact pull-ci-openshift-cluster-network-operator-release-4.8-e2e-aws-sdn-multi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-insights-operator-master-e2e (all) - 6 runs, 50% failed, 33% of failures match = 17% impact pull-ci-openshift-openshift-apiserver-master-e2e-cmd (all) - 2 runs, 50% failed, 100% of failures match = 50% impact pull-ci-openshift-origin-master-e2e-aws-disruptive (all) - 15 runs, 93% failed, 14% of failures match = 13% impact pull-ci-openshift-origin-master-e2e-gcp-disruptive (all) - 15 runs, 87% failed, 23% of failures match = 20% impact rehearse-18638-pull-ci-openshift-sdn-release-4.9-e2e-aws-multitenant (all) - 2 runs, 100% failed, 50% of failures match = 50% impact rehearse-19054-periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt (all) - 3 runs, 100% failed, 33% of failures match = 33% impact rehearse-19054-periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere (all) - 3 runs, 67% failed, 50% of failures match = 33% impact rehearse-19055-release-openshift-origin-installer-e2e-gcp-shared-vpc-4.9 (all) - 3 runs, 33% failed, 100% of failures match = 33% impact release-openshift-ocp-installer-e2e-remote-libvirt-compact-ppc64le-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact Seems like it's fairly old too: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=junit&search=KubePodCrashLooping+fired+for.*seco nds+with+labels.*cluster-version-operator' | grep -o ' [0-9]* days ago' | sort -n | tail -n1 13 days ago From CI search, [3] has both: alert KubePodCrashLooping fired for 95 seconds with labels: {container="cluster-version-operator", endpoint="https-main", job="kube-state-metrics", namespace="openshift-cluster-version", pod="cluster-version-operator-55956b8778-tdqpk", service="kube-state-metrics", severity="warning"} and hosted Loki logs. Checking its events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | (.firstTimestamp // .eventTime) + " " + .reason + ": " + .message' | grep BackOff 2021-06-07T21:52:31Z BackOff: Back-off restarting failed container From Loki [4]: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096"} | unpack | pod_name="cluster-version-operator-55956b8778-tdqpk" != "client-side throttling" ending at 21:53Z and scrolling down a bit: E0607 21:52:02.765241 1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.ci-op-6q598pg9-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context deadline exceeded I0607 21:52:02.765300 1 leaderelection.go:278] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition I0607 21:52:02.765349 1 request.go:857] Error in request: resource name may not be empty E0607 21:52:02.765366 1 leaderelection.go:301] Failed to release lock: resource name may not be empty I0607 21:52:02.765376 1 start.go:256] Stopped leading; shutting down. ... E0607 21:52:02.765456 1 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".16866b53a864dddd", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"LeaderElection", Message:"ip-10-0-128-153_90c7daae-06ce-4fd3-a50a-93eb355630d7 stopped leading", Source:v1.EventSource{Component:"openshift-cluster-version", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc027c320ad9d49dd, ext:342206815723, loc:(*time.Location)(0x2aab9e0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc027c320ad9d49dd, ext:342206815723, loc:(*time.Location)(0x2aab9e0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version''(may retry after sleeping) ... I0607 21:52:02.765790 1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded] ... E0607 21:52:02.765831 1 sync_worker.go:354] unable to synchronize image (waiting 21.369562456s): Cluster operator openshift-apiserver is degraded ... I0607 21:52:02.765932 1 cvo.go:373] Waiting on 4 outstanding goroutines. I0607 21:52:42.353584 1 start.go:21] ClusterVersionOperator v1.0.0-545-g706a0062-dirty So that's a 21:52Z 'context deadline exceeded' on the leader lock putting us into the shutdown cycle, while at the same time the OpenShift API-server is degraded. Not clear to me why we were having trouble reaching the lock. Passing: {invoker!=""} | unpack | container_name="cluster-version-operator" |= "error retrieving resource lock openshift-cluster-version/version" != "127.0.0.1:6443: connect: connection refused" To Loki turns up a number of other conditions, and these are fairly frequent. Not clear to me if they really deserve to be KubePodCrashLooping [5] or not. From [6]: kube_pod_container_status_restarts_total{namespace="openshift-cluster-version",job="kube-state-metrics"} shows restarts at 21:52Z (the one we found via events above), but also earlier at 21:46Z and 21:43Z. That's somewhat in the window of [3]: Jun 07 21:41:23.192 - 71ms E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests And indeed, back to [4]: E0607 21:45:59.455992 1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.ci-op-6q598pg9-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context deadline exceededShow I don't see anything quite at 21:43Z, but there is a double-TERM with a long stack trace that I'll attach here. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096 [4]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%221623102780000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade%2F1401977716080644096%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22cluster-version-operator-55956b8778-tdqpk%5C%22%20!%3D%20%5C%22client-side%20throttling%5C%22%22%7D%5D [5]: https://github.com/openshift/cluster-monitoring-operator/blame/b728aa6c986e51c4fea9f1111038fb2b3af7a723/assets/control-plane/prometheus-rule.yaml#L15-L21 [6]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096
Trevor is this an issue in CI? *** This bug has been marked as a duplicate of bug 1985802 ***