A bunch of recent 4.5 CI jobs have failed: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] with errors like [1]: fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected <map[string]error | len:1>: { "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": { s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.3:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1590802563.871,\"8\"]}]", }, } to be empty In the case of [1], the underlying reason seems to have been a local API-server outage: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.name == "kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2").status.containerStatuses[] | select(.name == "cluster-policy-controller").lastState.terminated | (.exitCode | tostring) + "\n" + .message' 255 m_amd64.s:1357: Failed to watch *v1beta1.Ingress: Get https://localhost:6443/apis/networking.k8s.io/v1beta1/ingresses?allowWatchBookmarks=true&resourceVersion=13379&timeout=6m23s&timeoutSeconds=383&watch=true: dial tcp [::1]:6443: connect: connection refused E0530 01:11:35.447019 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.ServiceAccount: Get https://localhost:6443/api/v1/serviceaccounts?allowWatchBookmarks=true&resourceVersion=17548&timeout=9m29s&timeoutSeconds=569&watch=true: dial tcp [::1]:6443: connect: connection refused E0530 01:11:35.448120 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.DeploymentConfig: Get https://localhost:6443/apis/apps.openshift.io/v1/deploymentconfigs?allowWatchBookmarks=true&resourceVersion=24025&timeout=5m20s&timeoutSeconds=320&watch=true: dial tcp [::1]:6443: connect: connection refused E0530 01:11:35.449101 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.Route: Get https://localhost:6443/apis/route.openshift.io/v1/routes?allowWatchBookmarks=true&resourceVersion=24025&timeout=6m23s&timeoutSeconds=383&watch=true: dial tcp [::1]:6443: connect: connection refused E0530 01:11:35.450198 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.BuildConfig: Get https://localhost:6443/apis/build.openshift.io/v1/buildconfigs?allowWatchBookmarks=true&resourceVersion=24025&timeout=9m36s&timeoutSeconds=576&watch=true: dial tcp [::1]:6443: connect: connection refused E0530 01:11:35.451437 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?allowWatchBookmarks=true&resourceVersion=24823&timeout=7m43s&timeoutSeconds=463&watch=true: dial tcp [::1]:6443: connect: connection refused I0530 01:11:35.602515 1 leaderelection.go:277] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition F0530 01:11:35.602590 1 policy_controller.go:94] leaderelection lost [2] suggests the crashlooping cluster-policy-controller error is really hurting 4.5 acceptance: release-openshift-origin-installer-e2e-gcp-4.5 - 70 runs, 46% failed, 38% of failures match although I guess it's possible that the container crashloops for different reasons in different jobs. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428 [2]: https://search.apps.build01.ci.devcluster.openshift.com/?search=promQL+query%3A+count_over_time.*KubePodCrashLooping.*cluster-policy-controller&maxAge=48h&context=1&type=bug%2Bjunit&name=release-openshift.*gcp-4.5&maxMatches=5&maxBytes=20971520&groupBy=job
release-openshift-ocp-installer-e2e-gcp-4.5 - 11 runs, 36% failed, 75% of failures match From comment 0's [2]. Are we sure we want to punt to 4.6?
A few more length-of-outage notes on the job I picked as the original example: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2_cluster-policy-controller_previous.log | grep '6443: connect: connection refused' | head -n1 E0530 01:10:53.321223 1 reflector.go:382] runtime/asm_amd64.s:1357: Failed to watch *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?allowWatchBookmarks=true&resourceVersion=13376&timeout=9m24s&timeoutSeconds=564&watch=true: dial tcp [::1]:6443: connect: connection refused $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2_cluster-policy-controller_previous.log | grep '6443: connect: connection refused' | cut -b -11 | uniq -c 245 E0530 01:10 1264 E0530 01:11 $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-jq5di4gm-2aad9-94bbs-master-2_cluster-policy-controller_previous.log | grep leaderelection I0530 01:06:40.071544 1 leaderelection.go:242] attempting to acquire leader lease openshift-kube-controller-manager/cluster-policy-controller... I0530 01:06:40.094608 1 leaderelection.go:252] successfully acquired lease openshift-kube-controller-manager/cluster-policy-controller E0530 01:11:00.603333 1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused E0530 01:11:10.604609 1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused E0530 01:11:20.604717 1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused E0530 01:11:30.604897 1 leaderelection.go:320] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller: dial tcp [::1]:6443: connect: connection refused I0530 01:11:35.602515 1 leaderelection.go:277] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition F0530 01:11:35.602590 1 policy_controller.go:94] leaderelection lost So 42 seconds with no local API server before the pod gives up and stops waiting. [1] shows the API-server pod starting logging another ~23s later at 01:11:58. Events from this vicinity: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | .firstTimestamp + " " + .lastTimestamp + " " + .message' | sort ... 2020-05-30T01:10:53Z 2020-05-30T01:10:53Z All pending requests processed 2020-05-30T01:10:53Z 2020-05-30T01:10:53Z Server has stopped listening 2020-05-30T01:10:53Z 2020-05-30T01:10:53Z The minimal shutdown duration of 1m10s finished ... 2020-05-30T01:11:58Z 2020-05-30T01:11:58Z Created container kube-apiserver-cert-regeneration-controller 2020-05-30T01:11:58Z 2020-05-30T01:11:58Z Created container kube-apiserver-cert-syncer 2020-05-30T01:11:58Z 2020-05-30T01:11:58Z Started container kube-apiserver ... Trigger for the Kube-API outage seems to have been the gen-6 rollout: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | .metadata.creationTimestamp + " " + .metadata.name + " " + .status.phase' | sort | grep master-2 | tail -n4 2020-05-30T00:50:09Z revision-pruner-4-ci-op-jq5di4gm-2aad9-94bbs-master-2 Succeeded 2020-05-30T01:09:32Z installer-6-ci-op-jq5di4gm-2aad9-94bbs-master-2 Succeeded 2020-05-30T01:09:43Z kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2 Running 2020-05-30T01:12:24Z revision-pruner-6-ci-op-jq5di4gm-2aad9-94bbs-master-2 Succeeded Poking at the API-server pod some more: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status' | sort 2020-05-30T00:45:37Z PodScheduled=True 2020-05-30T01:11:57Z Initialized=True 2020-05-30T01:12:09Z ContainersReady=True 2020-05-30T01:12:09Z Ready=True $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2").status.initContainerStatuses[].state.terminated | .startedAt + " " + .finishedAt + " " + (.exitCode | tostring) + " " + .reason' 2020-05-30T01:11:30Z 2020-05-30T01:11:56Z 0 Completed So the timeline is something like: * 1:06:40: cluster-policy-controller starts logging and becomes the leader. * 1:09:32: installer-6-ci-op-jq5di4gm-2aad9-94bbs-master-2 starts doing whatever it does. * 1:09:43: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2 created. * 1:10:53: cluster-policy-controller starts failing to connect to 6443. * 1:11:30: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2's init container starts up. * 1:11:35: cluster-policy-controller gives up and exits. * 1:11:56: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2's init container completes. * 1:11:58: kube-apiserver container starts logging. * 1:12:09: kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2 goes Ready=True. I'm not clear on what triggered the installer-6 rollout, but it's possible that cluster-policy-controller could avoid a restart by growing its leader lease-renew timeout from it's current ~42s to ~1m30s. Or maybe there's something we can do about explicit leader handoffs before a coming API-server outage? Or maybe we could tighten down API-server version bumps by keeping the old pod listening on 6443 and having the the new pod come in with SO_REUSEPORT and pick up cleanly without dropping traffic (no idea if we can SO_REUSEPORT node ports with CRI-O containers)? Or something... [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2428/artifacts/e2e-gcp/pods/openshift-kube-apiserver_kube-apiserver-ci-op-jq5di4gm-2aad9-94bbs-master-2_kube-apiserver.log
Poking this one again after two more days of silence: $ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?maxAge=48h&search=promQL+query%3A+count_ over_time.*KubePodCrashLooping.*cluster-policy-controller&name=release-openshift-' | grep 'failures match' release-openshift-ocp-installer-e2e-aws-4.5 - 15 runs, 67% failed, 60% of failures match release-openshift-ocp-installer-e2e-aws-4.6 - 28 runs, 79% failed, 36% of failures match release-openshift-ocp-installer-e2e-azure-4.5 - 8 runs, 88% failed, 14% of failures match release-openshift-ocp-installer-e2e-ovirt-4.6 - 4 runs, 100% failed, 25% of failures match release-openshift-ocp-installer-e2e-vsphere-upi-4.5 - 5 runs, 80% failed, 50% of failures match release-openshift-origin-installer-e2e-gcp-shared-vpc-4.5 - 1 runs, 100% failed, 200% of failures match release-openshift-ocp-installer-e2e-gcp-4.5 - 7 runs, 57% failed, 50% of failures match release-openshift-ocp-installer-e2e-aws-upi-4.5 - 6 runs, 50% failed, 33% of failures match release-openshift-origin-installer-e2e-gcp-4.6 - 12 runs, 33% failed, 75% of failures match release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-ocp-installer-e2e-gcp-rt-4.5 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-4.5 - 1 runs, 100% failed, 400% of failures match release-openshift-ocp-installer-e2e-openstack-4.5 - 4 runs, 100% failed, 25% of failures match promote-release-openshift-machine-os-content-e2e-aws-4.5 - 86 runs, 100% failed, 2% of failures match
the last 5 4.5-CI payloads were rejected due to failures when running on GCP, all the failures are some form of this crashlooping operator pod issue where the pod seems to be crashlooping because of leader election issues due to being unable to reach the apiserver. this probably isn't specific to the kube-controller-manager, but it certainly doesn't seem deferrable. examples: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2465 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2464 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2463 example logs: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2464/artifacts/e2e-gcp/pods/openshift-cluster-version_cluster-version-operator-5b598dd6c9-pdpxx_cluster-version-operator_previous.log E0608 01:01:23.779127 1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.ConfigMap: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-config/configmaps?allowWatchBookmarks=true&resourceVersion=23567&timeout=9m50s&timeoutSeconds=590&watch=true: dial tcp 127.0.0.1:6443: connect: connection refused I0608 01:01:29.222281 1 leaderelection.go:288] failed to renew lease openshift-cluster-version/version: timed out waiting for the condition W0608 01:01:29.222543 1 start.go:206] leaderelection lost E0608 01:01:29.222362 1 event.go:272] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping) I0608 01:01:29.222373 1 event.go:281] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ci-op-5s7c0frh-2aad9-89p48-master-0_0df5f8e3-f2da-4a84-ab76-1e166d3188ea stopped leading
(this is being discussed elsewhere, it sounds like the cause is that tests start running before the control plane is stable, so there's not a product issue here, but we do need to find a way to get our CI results back to green)
> ...it sounds like the cause is that tests start running before the control plane is stable... Possibly related discussion in [1]. The fix would be the control plane not going Available=True until it was actually installed. Or more narrowly, to add a sleeper step in the test templates to wait after install until the control plane had drunk its coffee. Although it's not clear to me why a new API-server roll-out destabilizing callers who are sensitive to 127.0.0.1:6443 down-time would be unique to install. Wouldn't that happen on updates too? [1]: https://github.com/openshift/origin/pull/24992
PR to ignore these alerts for now: https://github.com/openshift/origin/pull/25094 needs to be reverted when this is resolved. I will be cherrypicking that change to 4.5 as well, so it needs to be reverted in 4.5 also.
4.5 PR that will need to be reverted when this is resolved: https://github.com/openshift/origin/pull/25099
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/420 and https://github.com/openshift/cluster-kube-scheduler-operator/pull/254 might be a reasonable fixes for that. We'll need to backport that to 4.5 when this proofs working. I'll revert the origin PR Ben mentioned above when these 2 PRs land.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
this is suspected to be the cause of failures in: [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes
On oVirt master e2e tests we started seeing this on: "Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]" This caused PRs to fail because we run this test on every PR. Asked to ignore the alert on the Early test case as well
sample for comment 13: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/3773/pull-ci-openshift-installer-master-e2e-ovirt/1275836884324454400 fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected <map[string]error | len:1>: { "ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1": { s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"cluster-policy-controller\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.9:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-controller-manager\",\"pod\":\"kube-controller-manager-ovirt11-l2n6s-master-0\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1593029533.438,\"1\"]}]", }, } to be empty
also applies to test: [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over two minutes
*** Bug 1852974 has been marked as a duplicate of this bug. ***
*** Bug 1860190 has been marked as a duplicate of this bug. ***
This bug seems to affect 4.4 -> 4.5.3 upgrades on oVirt / BareMetal IPI, potentially Openstack too. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
> This bug seems to affect 4.4 -> 4.5.3 upgrades on oVirt / BareMetal IPI, potentially Openstack too. This bug seems to be independent of upgrades: I am seeing the same behavior when directly installing 4.5.3 on baremetal.
I'm hitting this BZ in an upgrade from 4.4.11 to 4.5.3 on an OCP-over-RHV cluster, in which the pods in question are CrashLooping as well. Please see attached must-gather info of that cluster. https://drive.google.com/file/d/1n1Zt-SmTWXHM3WLzFnKN-h25TEPPZReN/view?usp=sharing $ oc describe clusteroperators kube-controller-manager Name: kube-controller-manager Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-04-15T12:33:41Z Generation: 1 Resource Version: 172858317 Self Link: /apis/config.openshift.io/v1/clusteroperators/kube-controller-manager UID: 18e4aec4-7ca3-4377-bc68-15fb279bd554 Spec: Status: Conditions: Last Transition Time: 2020-07-26T16:04:52Z Message: StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-2 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-2_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b) StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-2 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-2_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b) StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-1 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-1_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b) StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-1 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-1_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b) StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-0 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-0_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b) StaticPodsDegraded: pod/kube-controller-manager-cnv-x4lcc-master-0 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-cnv-x4lcc-master-0_openshift-kube-controller-manager(adee5fa30eb0afa0d7589c759e75190b) Reason: StaticPods_Error Status: True Type: Degraded Last Transition Time: 2020-07-23T20:26:07Z Message: NodeInstallerProgressing: 3 nodes are at revision 69 Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2020-04-15T12:37:40Z Message: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 69 Reason: AsExpected Status: True Type: Available Last Transition Time: 2020-04-15T12:33:41Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: kubecontrollermanagers Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-kube-controller-manager Resource: namespaces Group: Name: openshift-kube-controller-manager-operator Resource: namespaces Versions: Name: raw-internal Version: 4.5.3 Name: kube-controller-manager Version: 1.18.3 Name: operator Version: 4.5.3 Events: <none>
I would like to mention that this bug also causes the kube-controller-manager_kube-controller-manager pod logs to explode (under /var/log/pods), causing DiskPressure and eventually brings the master node down.
This bug isn't fatal in any way and is just tracking firing the alert. The scenario goes like: KAS rolls out and KCM can't talk to the local apiserver and can't renew leader election lease which makes it forcefully exit(1). When it tries to listen on the serving port again it is hold by the kernel (TIME_WAIT) and KCM exits again, possibly few times with kubelet added back off but the port will get available and it will start after a bit. This isn't upgrade blocking as far as I can tell. The reported customer cases don't seem related - as those are bare metal I think they may be related to the haproxy conflict https://bugzilla.redhat.com/show_bug.cgi?id=1858498 - if that's not the case, please file a new BZ and attach must-gather.
Tomáš: I had previously filed https://bugzilla.redhat.com/show_bug.cgi?id=1860190 for the baremetal issue, but closed it as a dupe of this bug because the symptoms seem remarkably similar. There are some logs posted in the bz; if after looking at that you think it's actually a different problem let me know and I'm happy to reopen it.
We talked with Lars on Slack, he's gonna reopen https://bugzilla.redhat.com/show_bug.cgi?id=1860190 and attach must-gather there so we can find the cause in his cluster.
Neeraj please read my previous comment on this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1842002#c26 There really shouldn't be any customer bug attached to this BZ. See https://bugzilla.redhat.com/show_bug.cgi?id=1860190#c6 but you are likely hitting https://bugzilla.redhat.com/show_bug.cgi?id=1858498
Verified the bug in the payload below and followed the test steps below to verify 4.6.0-0.nightly-2020-08-07-042746 TestSteps Followed: ======================= 1) Install latest 4.6 cluster 2) on the same master, bring down the local kube-apiserver pod and see if the local KCM fails or keeps working use a master with KCM leader 3) To bring down local kube-apiserver, ssh to the master , mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /home/kube-apiserver-pod.yaml 4) Terminate it gracefully by running the command "pid=$(ps aux | grep " kube-apiserver " | grep -v grep | awk 'NR==1 {print $2}'); kill $pid" 5) Now keep checking KCM and api-server logs 6) Verify that KCM logs does not show any errors as described in the summary above. When verifying i see that all the other stuff is using the LB but webhook authenticator somehow doesn't have the master name replaced but KCM keeps running and hit the error below, will raise a separate bug for the error seen below. E0807 13:48:12.608278 1 webhook.go:111] Failed to make webhook authenticator request: Post "https://localhost:6443/apis/authentication.k8s.io/v1/tokenreviews": dial tcp [::1]:6443: connect: connection refused E0807 13:48:12.608328 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Post "https://localhost:6443/apis/authentication.k8s.io/v1/tokenreviews": dial tcp [::1]:6443: connect: connection refused] Based on the above moving the bug to verified state.
Aug 10 AWS release jobs still show flakes/failures on this job quite often, showing up on the build-watcher interface: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292725766252924928 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292762336389500928 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292733351848841216 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292686520171368448 Could you verify if these are related?
Hi Yu Qi Zhang, I looked at the links pasted above in comment 34 and do not see any errors as said in the description of the bug. However i will just wait for tomas also to take a look so that i did not miss anything important. @tomas, could you please help take a look? Thanks kasturi
Ah, sorry I should have clarified. This is based on Ben's comment above: this is suspected to be the cause of failures in: [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes As the build watcher today this error came up a decent amount and I wanted to see if this was the correct issue
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1292725766252924928 [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel] expand_less Run #0: Failed expand_less 4m3s fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-infra/recycler-for-nfs-ll6bd was pending entire time: unknown error", ] to be empty open stdoutopen_in_new Run #1: Failed expand_less 4m3s fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-infra/recycler-for-nfs-ll6bd was pending entire time: unknown error", ] to be empty what leads you to think it's connected to KCM (KubePodCrashLooping kube-contoller-manager cluster-policy-controller: 6443: connect: connection refused)
Ok, apologies, moving back to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196