+++ This bug was initially created as a clone of Bug #1851390 +++ +++ This bug was initially created as a clone of Bug #1851389 +++ kcm pod crashloops because port is already in use. I saw a case with cluster-policy-manager container but it's not limited to it. Crashlooping triggers alerts and adds backoff for the pod so it start slower. Container can be restarted while the pods stays. For that reason, we need to check the port availability in the same process as we listen, not in an init container which isn't re-run.
*** Bug 1851404 has been marked as a duplicate of this bug. ***
The mentioned fix (https://github.com/openshift/cluster-kube-controller-manager-operator/pull/423) is now released as part of 4.5.3 and leads to a crashloop in the kube-controller-manager-cloud/kube-controller-manager-recovery-controller container because of timeout. We are working an a oVirt IPI installed cluster which uses port 9443 as internal API LB (haproxy). Could you please clarify if checking the local kube-api on port 9443 was intentional, and what we can do to work around this issue. Thx!
> We are working an a oVirt IPI installed cluster which uses port 9443 as internal API LB (haproxy). This was not introduced by that PR, KCM has always listened on 9443. That is haproxy config bug that steals KCM port - introduced in https://github.com/openshift/baremetal-runtimecfg/pull/61 and fixed in https://github.com/openshift/baremetal-runtimecfg/pull/73
Tried verifying the bug but i see the kcm is in crashloopback and when checked further understood that haproxy occupied the port 9443, kcm always restarted. I am afraid if this will break upgrade paths of customers who has IPI installs for 4.4. Checked with dev if we can have the bug backport to all other releases once we have a proper fix in 4.5, waiting for their input. Once i receive the input will either mark this bug to verified or move it back to assigned state. Thanks !!
Bug verification blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1862898
Verified the bug with the payload below and on profile ipi-on-azure/versioned-installer-ovn-customer_vpc-http_proxy i see that no init-containers and the kube-controller-manager container will check the 10257 [ramakasturinarra@dhcp35-60 verification-tests]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-08-01-220435 True False 12h Cluster version is 4.4.0-0.nightly-2020-08-01-220435 [ramakasturinarra@dhcp35-60 verification-tests]$ oc describe pod kube-controller-manager-sunilc-4416-rfr7c-master-2 -n openshift-kube-controller-manager Name: kube-controller-manager-sunilc-4416-rfr7c-master-2 Namespace: openshift-kube-controller-manager Priority: 2000001000 Priority Class Name: system-node-critical Node: sunilc-4416-rfr7c-master-2/10.0.0.7 Start Time: Tue, 04 Aug 2020 22:55:29 +0530 Labels: app=kube-controller-manager kube-controller-manager=true revision=8 Annotations: kubectl.kubernetes.io/default-logs-container: kube-controller-manager kubernetes.io/config.hash: 52f1486d6d8b71b916383c4ba76d666c kubernetes.io/config.mirror: 52f1486d6d8b71b916383c4ba76d666c kubernetes.io/config.seen: 2020-08-04T17:35:01.226149575Z kubernetes.io/config.source: file Status: Running IP: 10.0.0.7 IPs: IP: 10.0.0.7 Controlled By: Node/sunilc-4416-rfr7c-master-2 Containers: kube-controller-manager: Container ID: cri-o://7400effdbf8a5ea362043a0ff811f98fe0851fe40f27141229450161becdeaf6 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fed963f4a3d4fa81891976fcda8e08d970e1ddfb4076ee4e048b70c581c2c49b Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fed963f4a3d4fa81891976fcda8e08d970e1ddfb4076ee4e048b70c581c2c49b Port: 10257/TCP Host Port: 10257/TCP Command: /bin/bash -euxo pipefail -c Args: timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10257 \))" ]; do sleep 1; done' if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then echo "Copying system trust bundle" cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem fi exec hyperkube kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml \ --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \ --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt \ --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=2 --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key State: Running Started: Tue, 04 Aug 2020 23:34:55 +0530 Ready: True Restart Count: 0 Requests: cpu: 80m memory: 200Mi Liveness: http-get https://:10257/healthz delay=45s timeout=10s period=10s #success=1 #failure=3 Readiness: http-get https://:10257/healthz delay=10s timeout=10s period=10s #success=1 #failure=3 Environment: HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com Mounts: /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) cluster-policy-controller: Container ID: cri-o://43a6b4d6f439af34a99b280ea9c2b64ec1dba6c02287f012670dfa7c141fd484 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bc367e8cb993f0194ad8288a29bb00e9362f9f9d123fb94c7c85f8349cd3599c Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bc367e8cb993f0194ad8288a29bb00e9362f9f9d123fb94c7c85f8349cd3599c Port: 10357/TCP Host Port: 10357/TCP Command: /bin/bash -euxo pipefail -c Args: timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10357 \))" ]; do sleep 1; done' exec cluster-policy-controller start --config=/etc/kubernetes/static-pod-resources/configmaps/cluster-policy-controller-config/config.yaml State: Running Started: Tue, 04 Aug 2020 23:34:56 +0530 Ready: True Restart Count: 0 Requests: cpu: 10m memory: 200Mi Liveness: http-get https://:10357/healthz delay=45s timeout=10s period=10s #success=1 #failure=3 Readiness: http-get https://:10357/healthz delay=10s timeout=10s period=10s #success=1 #failure=3 Environment: HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com Mounts: /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) kube-controller-manager-cert-syncer: Container ID: cri-o://0c065a7eaa6ae2c6330dd36164ed6f71f53ac4c11ae69de313a43ea79899e854 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d Port: <none> Host Port: <none> Command: cluster-kube-controller-manager-operator cert-syncer Args: --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig --namespace=$(POD_NAMESPACE) --destination-dir=/etc/kubernetes/static-pod-certs State: Running Started: Tue, 04 Aug 2020 23:34:57 +0530 Ready: True Restart Count: 0 Requests: cpu: 5m memory: 50Mi Environment: POD_NAME: kube-controller-manager-sunilc-4416-rfr7c-master-2 (v1:metadata.name) POD_NAMESPACE: openshift-kube-controller-manager (v1:metadata.namespace) HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com Mounts: /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) kube-controller-manager-recovery-controller: Container ID: cri-o://890278d00312c0e8d1c93f7740c82dd28e6b7282f7665e48ae844e8d528d65e3 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d Port: <none> Host Port: <none> Command: /bin/bash -euxo pipefail -c Args: timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 9443 \))" ]; do sleep 1; done' exec cluster-kube-controller-manager-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:9443 -v=2 State: Running Started: Tue, 04 Aug 2020 23:34:58 +0530 Ready: True Restart Count: 0 Requests: cpu: 5m memory: 50Mi Environment: POD_NAMESPACE: openshift-kube-controller-manager (v1:metadata.namespace) HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128 NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com Mounts: /etc/kubernetes/static-pod-certs from cert-dir (rw) /etc/kubernetes/static-pod-resources from resource-dir (rw) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: resource-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-8 HostPathType: cert-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/static-pod-resources/kube-controller-manager-certs HostPathType: QoS Class: Burstable Node-Selectors: <none> Tolerations: op=Exists Events: <none> Based on the above moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.4.16 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3237