1842002 – KubePodCrashLooping kube-contoller-manager cluster-policy-controller: 6443: connect: connection refused

Bug 1842002 - KubePodCrashLooping kube-contoller-manager cluster-policy-controller: 6443: connect: connection refused

Summary: KubePodCrashLooping kube-contoller-manager cluster-policy-controller: 6443: c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Tomáš Nožička
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1852974 (view as bug list)
Depends On:
Blocks:	1881534 1887431
TreeView+	depends on / blocked

Reported:	2020-05-30 05:09 UTC by W. Trevor King
Modified:	2023-10-06 20:19 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1881534 (view as bug list)
Environment:	[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes
Last Closed:	2020-10-27 16:01:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 431	0	None	closed	Bug 1842002: use internal LB to avoid outages during kube-apiserver rollout	2021-02-16 12:22:27 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:02:28 UTC

Description W. Trevor King 2020-05-30 05:09:09 UTC

A bunch of recent 4.5 CI jobs have failed:

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

with errors like [1]:

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.3:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1590802563.871,\"8\"]}]",
        },
    }
to be empty

In the case of [1], the underlying reason seems to have been a local API-server outage:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.name == "kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2").status.containerStatuses[] | select(.name == "cluster-policy-controller").lastState.terminated | (.exitCode | tostring) + "\n" + .message'
255
m_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/networking.k8s.io/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=13379&timeout=6m23s&timeoutSeconds=383&watch=true: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:35.447019       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.ServiceAccount: Get https://localhost:6443/api/v1/serviceaccounts?allowWatchBookmarks=true&resourceVersion=17548&timeout=9m29s&timeoutSeconds=569&watch=true: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:35.448120       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.DeploymentConfig: Get https://localhost:6443/apis/apps.openshift.io/v1/deploymentconfigs?allowWatchBookmarks=true&resourceVersion=24025&timeout=5m20s&timeoutSeconds=320&watch=true: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:35.449101       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.Route: Get https://localhost:6443/apis/route.openshift.io/v1/routes?allowWatchBookmarks=true&resourceVersion=24025&timeout=6m23s&timeoutSeconds=383&watch=true: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:35.450198       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.BuildConfig: Get https://localhost:6443/apis/build.openshift.io/v1/buildconfigs?allowWatchBookmarks=true&resourceVersion=24025&timeout=9m36s&timeoutSeconds=576&watch=true: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:35.451437       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?allowWatchBookmarks=true&resourceVersion=24823&timeout=7m43s&timeoutSeconds=463&watch=true: dial tcp [::1]:6443: connect: connection refused
I0530 01:11:35.602515       1 leaderelection.go:277] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition
F0530 01:11:35.602590       1 policy_controller.go:94] leaderelection lost

[2] suggests the crashlooping cluster-policy-controller error is really hurting 4.5 acceptance:

  release-openshift-origin-installer-e2e-gcp-4.5 - 70 runs, 46% failed, 38% of failures match 

although I guess it's possible that the container crashloops for different reasons in different jobs.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428
[2]: https://search.apps.build01.ci.devcluster.openshift.com/?search=promQL+query%3A+count_over_time.*KubePodCrashLooping.*cluster-policy-controller&maxAge=48h&context=1&type=bug%2Bjunit&name=release-openshift.*gcp-4.5&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 W. Trevor King 2020-06-05 03:41:15 UTC

release-openshift-ocp-installer-e2e-gcp-4.5 - 11 runs, 36% failed, 75% of failures match

From comment 0's [2].  Are we sure we want to punt to 4.6?

Comment 2 W. Trevor King 2020-06-07 05:38:06 UTC

A few more length-of-outage notes on the job I picked as the original example:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2_cluster-policy-controller_previous.log | grep '6443: connect: connection refused' | head -n1
E0530 01:10:53.321223       1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?allowWatchBookmarks=true&resourceVersion=13376&timeout=9m24s&timeoutSeconds=564&watch=true: dial tcp [::1]:6443: connect: connection refused
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2_cluster-policy-controller_previous.log | grep '6443: connect: connection refused' | cut -b -11 | uniq -c
    245 E0530 01:10
   1264 E0530 01:11
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2_cluster-policy-controller_previous.log | grep leaderelection
I0530 01:06:40.071544       1 leaderelection.go:242] attempting to acquire leader lease  openshift-kube-controller-manager/cluster-policy-controller...
I0530 01:06:40.094608       1 leaderelection.go:252] successfully acquired lease openshift-kube-controller-manager/cluster-policy-controller
E0530 01:11:00.603333       1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:10.604609       1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:20.604717       1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused
E0530 01:11:30.604897       1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused
I0530 01:11:35.602515       1 leaderelection.go:277] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition
F0530 01:11:35.602590       1 policy_controller.go:94] leaderelection lost

So 42 seconds with no local API server before the pod gives up and stops waiting.  [1] shows the API-server pod starting logging another ~23s later at 01:11:58.  Events from this vicinity:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | .firstTimestamp + " " + .lastTimestamp + " " + .message' | sort
...
2020-05-30T01:10:53Z 2020-05-30T01:10:53Z All pending requests processed
2020-05-30T01:10:53Z 2020-05-30T01:10:53Z Server has stopped listening
2020-05-30T01:10:53Z 2020-05-30T01:10:53Z The minimal shutdown duration of 1m10s finished
...
2020-05-30T01:11:58Z 2020-05-30T01:11:58Z Created container kube-apiserver-cert-regeneration-controller
2020-05-30T01:11:58Z 2020-05-30T01:11:58Z Created container kube-apiserver-cert-syncer
2020-05-30T01:11:58Z 2020-05-30T01:11:58Z Started container kube-apiserver
...

Trigger for the Kube-API outage seems to have been the gen-6 rollout:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | .metadata.creationTimestamp + " " + .metadata.name + " " + .status.phase' | sort | grep master-2 | tail -n4
2020-05-30T00:50:09Z revision-pruner-4-ci-op-jq5di4gm-2aad9-94bbs-master-2 Succeeded
2020-05-30T01:09:32Z installer-6-ci-op-jq5di4gm-2aad9-94bbs-master-2 Succeeded
2020-05-30T01:09:43Z kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2 Running
2020-05-30T01:12:24Z revision-pruner-6-ci-op-jq5di4gm-2aad9-94bbs-master-2 Succeeded

Poking at the API-server pod some more:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status' | sort
2020-05-30T00:45:37Z PodScheduled=True
2020-05-30T01:11:57Z Initialized=True
2020-05-30T01:12:09Z ContainersReady=True
2020-05-30T01:12:09Z Ready=True
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2").status.initContainerStatuses[].state.terminated | .startedAt + " " + .finishedAt + " " + (.exitCode | tostring) + " " + .reason'
2020-05-30T01:11:30Z 2020-05-30T01:11:56Z 0 Completed

So the timeline is something like:

* 1:06:40: cluster-policy-controller starts logging and becomes the leader.
* 1:09:32: installer-6-ci-op-jq5di4gm-2aad9-94bbs-master-2 starts doing whatever it does.
* 1:09:43: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2 created.
* 1:10:53: cluster-policy-controller starts failing to connect to 6443.
* 1:11:30: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2's init container starts up.
* 1:11:35: cluster-policy-controller gives up and exits.
* 1:11:56: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2's init container completes.
* 1:11:58: kube-apiserver container starts logging.
* 1:12:09: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2 goes Ready=True.

I'm not clear on what triggered the installer-6 rollout, but it's possible that cluster-policy-controller could avoid a restart by growing its leader lease-renew timeout from it's current ~42s to ~1m30s.  Or maybe there's something we can do about explicit leader handoffs before a coming API-server outage?  Or maybe we could tighten down API-server version bumps by keeping the old pod listening on 6443 and having the the new pod come in with SO_REUSEPORT and pick up cleanly without dropping traffic (no idea if we can SO_REUSEPORT node ports with CRI-O containers)?  Or something...

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-apiserver_kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2_kube-apiserver.log

Comment 3 W. Trevor King 2020-06-09 23:40:51 UTC

Poking this one again after two more days of silence:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?maxAge=48h&search=promQL+query%3A+count_
over_time.*KubePodCrashLooping.*cluster-policy-controller&name=release-openshift-' | grep 'failures match'
release-openshift-ocp-installer-e2e-aws-4.5 - 15 runs, 67% failed, 60% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 28 runs, 79% failed, 36% of failures match
release-openshift-ocp-installer-e2e-azure-4.5 - 8 runs, 88% failed, 14% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 4 runs, 100% failed, 25% of failures match
release-openshift-ocp-installer-e2e-vsphere-upi-4.5 - 5 runs, 80% failed, 50% of failures match
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.5 - 1 runs, 100% failed, 200% of failures match
release-openshift-ocp-installer-e2e-gcp-4.5 - 7 runs, 57% failed, 50% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.5 - 6 runs, 50% failed, 33% of failures match
release-openshift-origin-installer-e2e-gcp-4.6 - 12 runs, 33% failed, 75% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-4.5 - 1 runs, 100% failed, 400% of failures match
release-openshift-ocp-installer-e2e-openstack-4.5 - 4 runs, 100% failed, 25% of failures match
promote-release-openshift-machine-os-content-e2e-aws-4.5 - 86 runs, 100% failed, 2% of failures match

Comment 4 Ben Parees 2020-06-10 21:09:26 UTC

the last 5 4.5-CI payloads were rejected due to failures when running on GCP, all the failures are some form of this crashlooping operator pod issue where the pod seems to be crashlooping because of leader election issues due to being unable to reach the apiserver.

this probably isn't specific to the kube-controller-manager, but it certainly doesn't seem deferrable.


examples:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2465
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2464
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2463

example logs:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2464/artifacts/e2e-gcp/pods/openshift-cluster-version_cluster-version-operator-5b598dd6c9-pdpxx_cluster-version-operator_previous.log

E0608 01:01:23.779127       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.ConfigMap: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-config/configmaps?allowWatchBookmarks=true&resourceVersion=23567&timeout=9m50s&timeoutSeconds=590&watch=true: dial tcp 127.0.0.1:6443: connect: connection refused

I0608 01:01:29.222281       1 leaderelection.go:288] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition
W0608 01:01:29.222543       1 start.go:206] leaderelection lost
E0608 01:01:29.222362       1 event.go:272] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
I0608 01:01:29.222373       1 event.go:281] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ci-op-5s7c0frh-2aad9-89p48-master-0_0df5f8e3-f2da-4a84-ab76-1e166d3188ea stopped leading

Comment 5 Ben Parees 2020-06-10 21:40:05 UTC

(this is being discussed elsewhere, it sounds like the cause is that tests start running before the control plane is stable, so there's not a product issue here, but we do need to find a way to get our CI results back to green)

Comment 6 W. Trevor King 2020-06-10 21:55:09 UTC

> ...it sounds like the cause is that tests start running before the control plane is stable...

Possibly related discussion in [1].  The fix would be the control plane not going Available=True until it was actually installed.  Or more narrowly, to add a sleeper step in the test templates to wait after install until the control plane had drunk its coffee.  Although it's not clear to me why a new API-server roll-out destabilizing callers who are sensitive to 127.0.0.1:6443 down-time would be unique to install.  Wouldn't that happen on updates too?

[1]: https://github.com/openshift/origin/pull/24992

Comment 7 Ben Parees 2020-06-10 22:11:49 UTC

PR to ignore these alerts for now:
https://github.com/openshift/origin/pull/25094

needs to be reverted when this is resolved.  I will be cherrypicking that change to 4.5 as well, so it needs to be reverted in 4.5 also.

Comment 8 Ben Parees 2020-06-12 16:33:06 UTC

4.5 PR that will need to be reverted when this is resolved: https://github.com/openshift/origin/pull/25099

Comment 9 Maciej Szulik 2020-06-15 14:39:39 UTC

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/420 and https://github.com/openshift/cluster-kube-scheduler-operator/pull/254 might be a reasonable fixes for that.
We'll need to backport that to 4.5 when this proofs working. I'll revert the origin PR Ben mentioned above when these 2 PRs land.

Comment 11 Maciej Szulik 2020-06-18 10:16:47 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 12 Ben Parees 2020-06-24 17:01:34 UTC

this is suspected to be the cause of failures in:
[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes

Comment 13 Gal Zaidman 2020-06-25 14:50:58 UTC

On oVirt master e2e tests we started seeing this on:
"Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]"
This caused PRs to fail because we run this test on every PR.
Asked to ignore the alert on the Early test case as well

Comment 14 Ben Parees 2020-06-25 15:05:09 UTC

sample for comment 13:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/3773/pull-ci-openshift-installer-master-e2e-ovirt/1275836884324454400

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1": {
            s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ovirt11-l2n6s-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1593029533.438,\"1\"]}]",
        },
    }
to be empty

Comment 15 Ben Parees 2020-06-25 15:35:20 UTC

also applies to test:
[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over two minutes

Comment 16 Maciej Szulik 2020-07-06 10:57:30 UTC

*** Bug 1852974 has been marked as a duplicate of this bug. ***

Comment 20 Lars Kellogg-Stedman 2020-07-24 19:36:11 UTC

*** Bug 1860190 has been marked as a duplicate of this bug. ***

Comment 21 Vadim Rutkovsky 2020-07-25 07:41:29 UTC

This bug seems to affect 4.4 -> 4.5.3 upgrades on oVirt / BareMetal IPI, potentially Openstack too. 

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 23 Lars Kellogg-Stedman 2020-07-26 11:12:26 UTC

> This bug seems to affect 4.4 -> 4.5.3 upgrades on oVirt / BareMetal IPI, potentially Openstack too. 

This bug seems to be independent of upgrades: I am seeing the same behavior when directly installing 4.5.3 on baremetal.

Comment 24 Oren Cohen 2020-07-26 16:39:38 UTC

I'm hitting this BZ in an upgrade from 4.4.11 to 4.5.3 on an OCP-over-RHV cluster, in which the pods in question are CrashLooping as well. 
Please see attached must-gather info of that cluster.

https://drive.google.com/file/d/1n1Zt-SmTWXHM3WLzFnKN-h25TEPPZReN/view?usp=sharing

$ oc describe clusteroperators kube-controller-manager
Name:         kube-controller-manager
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-04-15T12:33:41Z
  Generation:          1
  Resource Version:    172858317
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/kube-controller-manager
  UID:                 18e4aec4-7ca3-4377-bc68-15fb279bd554
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-07-26T16:04:52Z
    Message:               StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-2 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-2_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b)
StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-2 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-2_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b)
StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-1 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-1_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b)
StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-1 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-1_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b)
StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-0 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-0_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b)
StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-0 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-0_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b)
    Reason:                StaticPods_Error
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-07-23T20:26:07Z
    Message:               NodeInstallerProgressing: 3 nodes are at revision 69
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-04-15T12:37:40Z
    Message:               StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 69
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-04-15T12:33:41Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  kubecontrollermanagers
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-controller-manager
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-controller-manager-operator
    Resource:  namespaces
  Versions:
    Name:     raw-internal
    Version:  4.5.3
    Name:     kube-controller-manager
    Version:  1.18.3
    Name:     operator
    Version:  4.5.3
Events:       <none>

Comment 25 Oren Cohen 2020-07-27 13:16:52 UTC

I would like to mention that this bug also causes the kube-controller-manager_kube-controller-manager pod logs to explode (under /var/log/pods), causing DiskPressure and eventually brings the master node down.

Comment 26 Tomáš Nožička 2020-07-27 14:35:15 UTC

This bug isn't fatal in any way and is just tracking firing the alert. The scenario goes like: KAS rolls out and KCM can't talk to the local apiserver and can't renew leader election lease which makes it forcefully exit(1). When it tries to listen on the serving port again it is hold by the kernel (TIME_WAIT) and KCM exits again, possibly few times with kubelet added back off but the port will get available and it will start after a bit.

This isn't upgrade blocking as far as I can tell.

The reported customer cases don't seem related - as those are bare metal I think they may be related to the haproxy conflict https://bugzilla.redhat.com/show_bug.cgi?id=1858498 - if that's not the case, please file a new BZ and attach must-gather.

Comment 27 Lars Kellogg-Stedman 2020-07-27 14:47:56 UTC

Tomáš: I had previously filed https://bugzilla.redhat.com/show_bug.cgi?id=1860190 for the baremetal issue, but closed it as a dupe of this bug because the symptoms seem remarkably similar. There are some logs posted in the bz; if after looking at that you think it's actually a different problem let me know and I'm happy to reopen it.

Comment 28 Tomáš Nožička 2020-07-27 15:18:56 UTC

We talked with Lars on Slack, he's gonna reopen https://bugzilla.redhat.com/show_bug.cgi?id=1860190 and attach must-gather there so we can find the cause in his cluster.

Comment 30 Tomáš Nožička 2020-07-28 07:32:27 UTC

Neeraj please read my previous comment on this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1842002#c26

There really shouldn't be any customer bug attached to this BZ.

See https://bugzilla.redhat.com/show_bug.cgi?id=1860190#c6 but you are likely hitting https://bugzilla.redhat.com/show_bug.cgi?id=1858498

Comment 33 RamaKasturi 2020-08-10 05:41:43 UTC

Verified the bug in the payload below and followed the test steps below to verify

4.6.0-0.nightly-2020-08-07-042746

TestSteps Followed:
=======================
1) Install latest 4.6 cluster
2) on the same master, bring down the local kube-apiserver pod and see if the local KCM fails or keeps working use a master with KCM leader
3) To bring down local kube-apiserver, ssh to the master , mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /home/kube-apiserver-pod.yaml
4) Terminate it gracefully by running the command "pid=$(ps aux | grep " kube-apiserver " | grep -v grep  | awk 'NR==1 {print $2}');
kill $pid"
5) Now keep checking KCM and api-server logs
6) Verify that KCM logs does not show any errors as described in the summary above.

When verifying i see that all the other stuff is using the LB but webhook authenticator somehow doesn't have the master name replaced but KCM keeps running and hit the error below, will raise a separate bug for the error seen below.

E0807 13:48:12.608278       1 webhook.go:111] Failed to make webhook authenticator request: Post "https://localhost:6443/apis/authentication.k8s.io/v1/tokenreviews": dial tcp [::1]:6443: connect: connection refused
E0807 13:48:12.608328       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Post "https://localhost:6443/apis/authentication.k8s.io/v1/tokenreviews": dial tcp [::1]:6443: connect: connection refused]

Based on the above moving the bug to verified state.

Comment 34 Yu Qi Zhang 2020-08-10 14:29:37 UTC

Aug 10 AWS release jobs still show flakes/failures on this job quite often, showing up on the build-watcher interface:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292725766252924928
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292762336389500928
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292733351848841216
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292686520171368448

Could you verify if these are related?

Comment 35 RamaKasturi 2020-08-10 16:01:24 UTC

Hi Yu Qi Zhang,

    I looked at the links pasted above in comment 34 and do not see any errors as said in the description of the bug. However i will just wait for tomas also to take a look so that i did not miss anything important. 

@tomas, could you please help take a look? 

Thanks
kasturi

Comment 36 Yu Qi Zhang 2020-08-10 16:14:39 UTC

Ah, sorry I should have clarified. This is based on Ben's comment above:

this is suspected to be the cause of failures in:
[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes

As the build watcher today this error came up a decent amount and I wanted to see if this was the correct issue

Comment 37 Tomáš Nožička 2020-08-10 17:02:22 UTC

> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292725766252924928

[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel] expand_less
Run #0: Failed expand_less	4m3s
fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-infra/recycler-for-nfs-ll6bd was pending entire time: unknown error",
    ]
to be empty
open stdoutopen_in_new
Run #1: Failed expand_less	4m3s
fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-infra/recycler-for-nfs-ll6bd was pending entire time: unknown error",
    ]
to be empty

what leads you to think it's connected to KCM (KubePodCrashLooping kube-contoller-manager cluster-policy-controller: 6443: connect: connection refused)

Comment 38 Yu Qi Zhang 2020-08-10 17:43:44 UTC

Ok, apologies, moving back to verified

Comment 40 errata-xmlrpc 2020-10-27 16:01:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.