1969257 – Occasional KubePodCrashLooping cluster-version-operator, possibly leader election

Bug 1969257 - Occasional KubePodCrashLooping cluster-version-operator, possibly leader election

Summary: Occasional KubePodCrashLooping cluster-version-operator, possibly leader elec...

Keywords:
Status:	CLOSED DUPLICATE of bug 1985802
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Over the Air Updates
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-08 05:25 UTC by W. Trevor King
Modified:	2022-05-06 12:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-29 16:43:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
double-TERM stack trace from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096 (99.09 KB, text/plain) 2021-06-08 05:25 UTC, W. Trevor King	no flags	Details
View All

Description W. Trevor King 2021-06-08 05:25:17 UTC

Created attachment 1789310 [details]
double-TERM stack trace from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096

Seen in 4.8.0-0.ci-2021-06-07-190057 CI [1]:

  [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
  Run #0: Failed	11s
  fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun  7 20:04:11.747: Unexpected alerts fired or pending after the test run:

    alert KubePodCrashLooping fired for 120 seconds with labels: {container="cluster-version-operator", endpoint="https-main", job="kube-state-metrics", namespace="openshift-cluster-version", pod="cluster-version-operator-f7749cd86-txqns", service="kube-state-metrics", severity="warning"}

From PromeCIeus [2]:

  ALERTS{alertname="KubePodCrashLooping",alertstate="firing"}

shows:

  ALERTS{alertname="KubePodCrashLooping", alertstate="firing", container="cluster-version-operator", endpoint="https-main", job="kube-state-metrics", namespace="openshift-cluster-version", pod="cluster-version-operator-f7749cd86-txqns", service="kube-state-metrics", severity="warning"}

from ~19:40Z to ~19:41Z.  Checking events:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136/artifacts/e2e-gcp/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | (.firstTimestamp // .eventTime) + " " + .reason + ": " + .message' | sort | grep -B1 failed
  ...
  2021-06-07T19:24:13Z PayloadLoaded: payload loaded version="4.8.0-0.ci-2021-06-07-190057" image="registry.build01.ci.openshift.org/ci-op-m1d1ty3z/release@sha256:f3ff2028f144ab39114209ca0938cd9ad0fc89bc597e8de76f5b3fd37da6a144"
  2021-06-07T19:24:57Z BackOff: Back-off restarting failed container

That's just a bit too old for us to have gathered the logs.  The previous.log we did gather starts a few seconds later:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-f7749cd86-txqns_cluster-version-operator_previous.log | head -n1
  I0607 19:25:08.928389       1 start.go:21] ClusterVersionOperator v1.0.0-545-g706a0062-dirty

Seems fairly common:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=KubePodCrashLooping+fired+for.*seconds+with+labels.*cluster-version-operator' | grep 'failures match' | sort
  periodic-ci-openshift-release-master-ci-4.8-e2e-gcp (all) - 16 runs, 44% failed, 29% of failures match = 13% impact
  periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact
  periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 19 runs, 100% failed, 5% of failures match = 5% impact
  periodic-ci-openshift-release-master-nightly-4.8-e2e-aws (all) - 12 runs, 25% failed, 33% of failures match = 8% impact
  periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 9 runs, 89% failed, 13% of failures match = 11% impact
  periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi (all) - 10 runs, 80% failed, 25% of failures match = 20% impact
  periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-upi-serial (all) - 10 runs, 70% failed, 14% of failures match = 10% impact
  periodic-ci-openshift-release-master-okd-4.8-e2e-aws (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
  periodic-ci-openshift-release-master-okd-4.8-e2e-vsphere (all) - 9 runs, 100% failed, 11% of failures match = 11% impact
  pull-ci-cri-o-cri-o-master-e2e-gcp (all) - 24 runs, 13% failed, 33% of failures match = 4% impact
  pull-ci-openshift-cluster-network-operator-release-4.8-e2e-aws-sdn-multi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
  pull-ci-openshift-insights-operator-master-e2e (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
  pull-ci-openshift-openshift-apiserver-master-e2e-cmd (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
  pull-ci-openshift-origin-master-e2e-aws-disruptive (all) - 15 runs, 93% failed, 14% of failures match = 13% impact
  pull-ci-openshift-origin-master-e2e-gcp-disruptive (all) - 15 runs, 87% failed, 23% of failures match = 20% impact
  rehearse-18638-pull-ci-openshift-sdn-release-4.9-e2e-aws-multitenant (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
  rehearse-19054-periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
  rehearse-19054-periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
  rehearse-19055-release-openshift-origin-installer-e2e-gcp-shared-vpc-4.9 (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
  release-openshift-ocp-installer-e2e-remote-libvirt-compact-ppc64le-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact

Seems like it's fairly old too:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=junit&search=KubePodCrashLooping+fired+for.*seco
nds+with+labels.*cluster-version-operator' | grep -o ' [0-9]* days ago' | sort -n | tail -n1
   13 days ago

From CI search, [3] has both:

  alert KubePodCrashLooping fired for 95 seconds with labels: {container="cluster-version-operator", endpoint="https-main", job="kube-state-metrics", namespace="openshift-cluster-version", pod="cluster-version-operator-55956b8778-tdqpk", service="kube-state-metrics", severity="warning"}

and hosted Loki logs.  Checking its events:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | (.firstTimestamp // .eventTime) + " " + .reason + ": " + .message' | grep BackOff
  2021-06-07T21:52:31Z BackOff: Back-off restarting failed container

From Loki [4]:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096"} | unpack | pod_name="cluster-version-operator-55956b8778-tdqpk" != "client-side throttling"

ending at 21:53Z and scrolling down a bit:

  E0607 21:52:02.765241       1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.ci-op-6q598pg9-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context deadline exceeded
  I0607 21:52:02.765300       1 leaderelection.go:278] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition
  I0607 21:52:02.765349       1 request.go:857] Error in request: resource name may not be empty
  E0607 21:52:02.765366       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
  I0607 21:52:02.765376       1 start.go:256] Stopped leading; shutting down.
  ...
  E0607 21:52:02.765456       1 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".16866b53a864dddd", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"LeaderElection", Message:"ip-10-0-128-153_90c7daae-06ce-4fd3-a50a-93eb355630d7 stopped leading", Source:v1.EventSource{Component:"openshift-cluster-version", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc027c320ad9d49dd, ext:342206815723, loc:(*time.Location)(0x2aab9e0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc027c320ad9d49dd, ext:342206815723, loc:(*time.Location)(0x2aab9e0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version''(may retry after sleeping)
  ...
  I0607 21:52:02.765790       1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded]
  ...
  E0607 21:52:02.765831       1 sync_worker.go:354] unable to synchronize image (waiting 21.369562456s): Cluster operator openshift-apiserver is degraded
  ...
  I0607 21:52:02.765932       1 cvo.go:373] Waiting on 4 outstanding goroutines.
  I0607 21:52:42.353584       1 start.go:21] ClusterVersionOperator v1.0.0-545-g706a0062-dirty

So that's a 21:52Z 'context deadline exceeded' on the leader lock putting us into the shutdown cycle, while at the same time the OpenShift API-server is degraded.  Not clear to me why we were having trouble reaching the lock.  Passing:

  {invoker!=""} | unpack | container_name="cluster-version-operator" |= "error retrieving resource lock openshift-cluster-version/version" != "127.0.0.1:6443: connect: connection refused"

To Loki turns up a number of other conditions, and these are fairly frequent.  Not clear to me if they really deserve to be KubePodCrashLooping [5] or not.  From [6]:

  kube_pod_container_status_restarts_total{namespace="openshift-cluster-version",job="kube-state-metrics"}

shows restarts at 21:52Z (the one we found via events above), but also earlier at 21:46Z and 21:43Z.

That's somewhat in the window of [3]:

  Jun 07 21:41:23.192 - 71ms  E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests

And indeed, back to [4]:

  E0607 21:45:59.455992       1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.ci-op-6q598pg9-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context deadline exceededShow

I don't see anything quite at 21:43Z, but there is a double-TERM with a long stack trace that I'll attach here.  

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1401977716122587136
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096
[4]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%221623102780000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade%2F1401977716080644096%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22cluster-version-operator-55956b8778-tdqpk%5C%22%20!%3D%20%5C%22client-side%20throttling%5C%22%22%7D%5D
[5]: https://github.com/openshift/cluster-monitoring-operator/blame/b728aa6c986e51c4fea9f1111038fb2b3af7a723/assets/control-plane/prometheus-rule.yaml#L15-L21
[6]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1401977716080644096

Comment 1 Lalatendu Mohanty 2021-09-29 16:43:58 UTC

Trevor is this an issue in CI?

*** This bug has been marked as a duplicate of bug 1985802 ***

Note You need to log in before you can comment on or make changes to this bug.