Bug 1851397
| Summary: | kcm pod crashloops because port is already in use | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Tomáš Nožička <tnozicka> | |
| Component: | kube-controller-manager | Assignee: | Tomáš Nožička <tnozicka> | |
| Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.6 | CC: | aos-bugs, knarra, mfojtik, michael.riedmann, yinzhou | |
| Target Milestone: | --- | |||
| Target Release: | 4.4.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1851390 | |||
| : | 1851398 (view as bug list) | Environment: | ||
| Last Closed: | 2020-08-06 19:07:59 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1851390 | |||
| Bug Blocks: | 1851398 | |||
|
Description
Tomáš Nožička
2020-06-26 12:25:14 UTC
*** Bug 1851404 has been marked as a duplicate of this bug. *** The mentioned fix (https://github.com/openshift/cluster-kube-controller-manager-operator/pull/423) is now released as part of 4.5.3 and leads to a crashloop in the kube-controller-manager-cloud/kube-controller-manager-recovery-controller container because of timeout. We are working an a oVirt IPI installed cluster which uses port 9443 as internal API LB (haproxy). Could you please clarify if checking the local kube-api on port 9443 was intentional, and what we can do to work around this issue. Thx! > We are working an a oVirt IPI installed cluster which uses port 9443 as internal API LB (haproxy). This was not introduced by that PR, KCM has always listened on 9443. That is haproxy config bug that steals KCM port - introduced in https://github.com/openshift/baremetal-runtimecfg/pull/61 and fixed in https://github.com/openshift/baremetal-runtimecfg/pull/73 Tried verifying the bug but i see the kcm is in crashloopback and when checked further understood that haproxy occupied the port 9443, kcm always restarted. I am afraid if this will break upgrade paths of customers who has IPI installs for 4.4. Checked with dev if we can have the bug backport to all other releases once we have a proper fix in 4.5, waiting for their input. Once i receive the input will either mark this bug to verified or move it back to assigned state. Thanks !! Bug verification blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1862898 Verified the bug with the payload below and on profile ipi-on-azure/versioned-installer-ovn-customer_vpc-http_proxy i see that no init-containers and the kube-controller-manager container will check the 10257
[ramakasturinarra@dhcp35-60 verification-tests]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.0-0.nightly-2020-08-01-220435 True False 12h Cluster version is 4.4.0-0.nightly-2020-08-01-220435
[ramakasturinarra@dhcp35-60 verification-tests]$ oc describe pod kube-controller-manager-sunilc-4416-rfr7c-master-2 -n openshift-kube-controller-manager
Name: kube-controller-manager-sunilc-4416-rfr7c-master-2
Namespace: openshift-kube-controller-manager
Priority: 2000001000
Priority Class Name: system-node-critical
Node: sunilc-4416-rfr7c-master-2/10.0.0.7
Start Time: Tue, 04 Aug 2020 22:55:29 +0530
Labels: app=kube-controller-manager
kube-controller-manager=true
revision=8
Annotations: kubectl.kubernetes.io/default-logs-container: kube-controller-manager
kubernetes.io/config.hash: 52f1486d6d8b71b916383c4ba76d666c
kubernetes.io/config.mirror: 52f1486d6d8b71b916383c4ba76d666c
kubernetes.io/config.seen: 2020-08-04T17:35:01.226149575Z
kubernetes.io/config.source: file
Status: Running
IP: 10.0.0.7
IPs:
IP: 10.0.0.7
Controlled By: Node/sunilc-4416-rfr7c-master-2
Containers:
kube-controller-manager:
Container ID: cri-o://7400effdbf8a5ea362043a0ff811f98fe0851fe40f27141229450161becdeaf6
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fed963f4a3d4fa81891976fcda8e08d970e1ddfb4076ee4e048b70c581c2c49b
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fed963f4a3d4fa81891976fcda8e08d970e1ddfb4076ee4e048b70c581c2c49b
Port: 10257/TCP
Host Port: 10257/TCP
Command:
/bin/bash
-euxo
pipefail
-c
Args:
timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10257 \))" ]; do sleep 1; done'
if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then
echo "Copying system trust bundle"
cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
fi
exec hyperkube kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml \
--kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
--authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
--authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
--client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt \
--requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=2 --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key
State: Running
Started: Tue, 04 Aug 2020 23:34:55 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 80m
memory: 200Mi
Liveness: http-get https://:10257/healthz delay=45s timeout=10s period=10s #success=1 #failure=3
Readiness: http-get https://:10257/healthz delay=10s timeout=10s period=10s #success=1 #failure=3
Environment:
HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
Mounts:
/etc/kubernetes/static-pod-certs from cert-dir (rw)
/etc/kubernetes/static-pod-resources from resource-dir (rw)
cluster-policy-controller:
Container ID: cri-o://43a6b4d6f439af34a99b280ea9c2b64ec1dba6c02287f012670dfa7c141fd484
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bc367e8cb993f0194ad8288a29bb00e9362f9f9d123fb94c7c85f8349cd3599c
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bc367e8cb993f0194ad8288a29bb00e9362f9f9d123fb94c7c85f8349cd3599c
Port: 10357/TCP
Host Port: 10357/TCP
Command:
/bin/bash
-euxo
pipefail
-c
Args:
timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10357 \))" ]; do sleep 1; done'
exec cluster-policy-controller start --config=/etc/kubernetes/static-pod-resources/configmaps/cluster-policy-controller-config/config.yaml
State: Running
Started: Tue, 04 Aug 2020 23:34:56 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 10m
memory: 200Mi
Liveness: http-get https://:10357/healthz delay=45s timeout=10s period=10s #success=1 #failure=3
Readiness: http-get https://:10357/healthz delay=10s timeout=10s period=10s #success=1 #failure=3
Environment:
HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
Mounts:
/etc/kubernetes/static-pod-certs from cert-dir (rw)
/etc/kubernetes/static-pod-resources from resource-dir (rw)
kube-controller-manager-cert-syncer:
Container ID: cri-o://0c065a7eaa6ae2c6330dd36164ed6f71f53ac4c11ae69de313a43ea79899e854
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
Port: <none>
Host Port: <none>
Command:
cluster-kube-controller-manager-operator
cert-syncer
Args:
--kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig
--namespace=$(POD_NAMESPACE)
--destination-dir=/etc/kubernetes/static-pod-certs
State: Running
Started: Tue, 04 Aug 2020 23:34:57 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 5m
memory: 50Mi
Environment:
POD_NAME: kube-controller-manager-sunilc-4416-rfr7c-master-2 (v1:metadata.name)
POD_NAMESPACE: openshift-kube-controller-manager (v1:metadata.namespace)
HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
Mounts:
/etc/kubernetes/static-pod-certs from cert-dir (rw)
/etc/kubernetes/static-pod-resources from resource-dir (rw)
kube-controller-manager-recovery-controller:
Container ID: cri-o://890278d00312c0e8d1c93f7740c82dd28e6b7282f7665e48ae844e8d528d65e3
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
Port: <none>
Host Port: <none>
Command:
/bin/bash
-euxo
pipefail
-c
Args:
timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 9443 \))" ]; do sleep 1; done'
exec cluster-kube-controller-manager-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:9443 -v=2
State: Running
Started: Tue, 04 Aug 2020 23:34:58 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 5m
memory: 50Mi
Environment:
POD_NAMESPACE: openshift-kube-controller-manager (v1:metadata.namespace)
HTTPS_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
HTTP_PROXY: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
NO_PROXY: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
Mounts:
/etc/kubernetes/static-pod-certs from cert-dir (rw)
/etc/kubernetes/static-pod-resources from resource-dir (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
resource-dir:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-8
HostPathType:
cert-dir:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/static-pod-resources/kube-controller-manager-certs
HostPathType:
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: op=Exists
Events: <none>
Based on the above moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.4.16 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3237 |