Bug 1908145
Summary: | kube-scheduler-recovery-controller container crash loop when router pod is co-scheduled | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Seth Jennings <sjenning> |
Component: | kube-scheduler | Assignee: | Mike Dame <mdame> |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.7 | CC: | aos-bugs, jluhrsen, mfojtik, wking |
Target Milestone: | --- | Keywords: | TestBlockerForLayeredProduct |
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel]
|
|
Last Closed: | 2021-02-24 15:44:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Seth Jennings
2020-12-15 22:48:42 UTC
Both components should probably also register their ports in [1] or somewhere in that doc. [1]: https://github.com/openshift/enhancements/blob/5f2529a2a02a73aad17620d643e89eed189f14e3/enhancements/network/host-port-registry.md#localhost-only Mike, we'll probably need to pick a different port for recovery controller, sync with Tomas if in doubt. I'm marking this a blocker+ since this is affecting the stability of the product when we're running in a schedulable masters configuration. fyi port used by the router updated in the port registry https://github.com/openshift/enhancements/pull/568 I opened 2 PRs: - https://github.com/openshift/cluster-kube-scheduler-operator/pull/311, to change the port in kube-scheduler to 11443 (just a guess, need to confirm this value works) - https://github.com/openshift/enhancements/pull/569, to add that, and the kube-controller-manager port for the same controller, to the registry *** Bug 1910417 has been marked as a duplicate of this bug. *** Dropping a reference to at least one of the e2e test-cases this kills (for compact clusters), to make this issue more discoverable in Sippy. Similar issue was hit when an upgrade was performed from 4.2 to 4.7 nightly build and as per the discussion with dev the PR here should fix the issue. Verified with the latest build below and i see that port has been changed to 11443 instead of 10443. Also tried an upgrade from 4.2 to 4.7 which was failing before the fix and now i could see that it passes with latest 4.7 build. [knarra@knarra openshift-client-linux-4.7.0-0.nightly-2021-01-10-070949]$ ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-10-070949 True False 6h7m Cluster version is 4.7.0-0.nightly-2021-01-10-070949 Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE 01-11 20:54:42 authentication 4.7.0-0.nightly-2021-01-10-070949 True False False 3m42s 01-11 20:54:42 baremetal 4.7.0-0.nightly-2021-01-10-070949 True False False 25m 01-11 20:54:42 cloud-credential 4.7.0-0.nightly-2021-01-10-070949 True False False 4h19m 01-11 20:54:42 cluster-autoscaler 4.7.0-0.nightly-2021-01-10-070949 True False False 4h11m 01-11 20:54:42 config-operator 4.7.0-0.nightly-2021-01-10-070949 True False False 138m 01-11 20:54:42 console 4.7.0-0.nightly-2021-01-10-070949 True False False 23m port before the fix: =========================== name: cert-dir - args: - | timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10443 \))" ]; do sleep 1; done' exec cluster-kube-scheduler-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-scheduler-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:10443 -v=2 port after the fix: ============================= - args: - | timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 11443 \))" ]; do sleep 1; done' exec cluster-kube-scheduler-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-scheduler-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:11443 -v=2 Based on comment 9 moving the bug to verified state. (In reply to RamaKasturi from comment #10) > Based on comment 9 moving the bug to verified state. Also tried both UPI & IPI installs where node has both master & worker role but could not reproduce the crash. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |