Hide Forgot
Description of problem: On installs where the masters are schedulable and the ingress router can be scheduled on a master, the router already uses port 10443 leading to a crash loop on the new kube-scheduler-recovery-controller container. https://github.com/openshift/router/blob/cca042b8b1ef6c3acc176c6c0a04908b5dd45b2e/images/router/haproxy/conf/haproxy-config.template#L357 name: kube-scheduler-recovery-controller command: - /bin/bash - '-euxo' - pipefail - '-c' ... args: - > timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10443 \))" ]; do sleep 1; done' exec cluster-kube-scheduler-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-scheduler-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:10443 -v=2 Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-15-081329 How reproducible: Always Steps to Reproduce: 1. Install 3-node cluster, make masters schedulable (i.e. masters have both master and worker roles). 2. 3. Actual results: $ oc get pod | grep kube-sche openshift-kube-scheduler-master-0.ocp-dev.variantweb.net 2/3 CrashLoopBackOff 20 140m openshift-kube-scheduler-master-1.ocp-dev.variantweb.net 3/3 Running 1 142m openshift-kube-scheduler-master-2.ocp-dev.variantweb.net 2/3 CrashLoopBackOff 21 146m $ oc get pod -n openshift-ingress -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-86dcd458d8-7vbj2 1/1 Running 0 152m 10.42.11.118 master-0.ocp-dev.variantweb.net <none> <none> router-default-86dcd458d8-vvxgb 1/1 Running 0 152m 10.42.11.120 master-2.ocp-dev.variantweb.net <none> <none> Expected results: kube-scheduler pod should be able to run successfully on the same node as the router Additional info:
Both components should probably also register their ports in [1] or somewhere in that doc. [1]: https://github.com/openshift/enhancements/blob/5f2529a2a02a73aad17620d643e89eed189f14e3/enhancements/network/host-port-registry.md#localhost-only
Mike, we'll probably need to pick a different port for recovery controller, sync with Tomas if in doubt. I'm marking this a blocker+ since this is affecting the stability of the product when we're running in a schedulable masters configuration.
fyi port used by the router updated in the port registry https://github.com/openshift/enhancements/pull/568
I opened 2 PRs: - https://github.com/openshift/cluster-kube-scheduler-operator/pull/311, to change the port in kube-scheduler to 11443 (just a guess, need to confirm this value works) - https://github.com/openshift/enhancements/pull/569, to add that, and the kube-controller-manager port for the same controller, to the registry
*** Bug 1910417 has been marked as a duplicate of this bug. ***
Dropping a reference to at least one of the e2e test-cases this kills (for compact clusters), to make this issue more discoverable in Sippy.
Similar issue was hit when an upgrade was performed from 4.2 to 4.7 nightly build and as per the discussion with dev the PR here should fix the issue.
Verified with the latest build below and i see that port has been changed to 11443 instead of 10443. Also tried an upgrade from 4.2 to 4.7 which was failing before the fix and now i could see that it passes with latest 4.7 build. [knarra@knarra openshift-client-linux-4.7.0-0.nightly-2021-01-10-070949]$ ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-10-070949 True False 6h7m Cluster version is 4.7.0-0.nightly-2021-01-10-070949 Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE 01-11 20:54:42 authentication 4.7.0-0.nightly-2021-01-10-070949 True False False 3m42s 01-11 20:54:42 baremetal 4.7.0-0.nightly-2021-01-10-070949 True False False 25m 01-11 20:54:42 cloud-credential 4.7.0-0.nightly-2021-01-10-070949 True False False 4h19m 01-11 20:54:42 cluster-autoscaler 4.7.0-0.nightly-2021-01-10-070949 True False False 4h11m 01-11 20:54:42 config-operator 4.7.0-0.nightly-2021-01-10-070949 True False False 138m 01-11 20:54:42 console 4.7.0-0.nightly-2021-01-10-070949 True False False 23m port before the fix: =========================== name: cert-dir - args: - | timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10443 \))" ]; do sleep 1; done' exec cluster-kube-scheduler-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-scheduler-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:10443 -v=2 port after the fix: ============================= - args: - | timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 11443 \))" ]; do sleep 1; done' exec cluster-kube-scheduler-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-scheduler-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:11443 -v=2
Based on comment 9 moving the bug to verified state.
(In reply to RamaKasturi from comment #10) > Based on comment 9 moving the bug to verified state. Also tried both UPI & IPI installs where node has both master & worker role but could not reproduce the crash.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633