1851397 – kcm pod crashloops because port is already in use

Bug 1851397 - kcm pod crashloops because port is already in use

Summary: kcm pod crashloops because port is already in use

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Tomáš Nožička
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1851404 (view as bug list)
Depends On:	1851390
Blocks:	1851398
TreeView+	depends on / blocked

Reported:	2020-06-26 12:25 UTC by Tomáš Nožička
Modified:	2020-08-06 19:08 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1851390
Clones:	1851398 (view as bug list)
Environment:
Last Closed:	2020-08-06 19:07:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 423	None	closed	[release-4.4] Bug 1851397: Fix port check 4.4	2020-12-22 19:22:57 UTC
Github	openshift images pull 21	None	closed	[release-4.4] Bug 1851397: Add iproute package to get `ss` tool for port wait	2020-12-22 19:22:24 UTC
Red Hat Product Errata	RHBA-2020:3237	None	None	None	2020-08-06 19:08:27 UTC

Description Tomáš Nožička 2020-06-26 12:25:14 UTC

+++ This bug was initially created as a clone of Bug #1851390 +++

+++ This bug was initially created as a clone of Bug #1851389 +++

kcm pod crashloops because port is already in use. I saw a case with cluster-policy-manager container but it's not limited to it.

Crashlooping triggers alerts and adds backoff for the pod so it start slower.

Container can be restarted while the pods stays. For that reason, we need to check the port availability in the same process as we listen, not in an init container which isn't re-run.

Comment 1 Tomáš Nožička 2020-06-26 12:58:15 UTC

*** Bug 1851404 has been marked as a duplicate of this bug. ***

Comment 3 Michael Riedmann 2020-07-22 15:08:48 UTC

The mentioned fix (https://github.com/openshift/cluster-kube-controller-manager-operator/pull/423) is now released as part of 4.5.3 and leads to a crashloop in the kube-controller-manager-cloud/kube-controller-manager-recovery-controller container because of timeout.
We are working an a oVirt IPI installed cluster which uses port 9443 as internal API LB (haproxy). Could you please clarify if checking the local kube-api on port 9443 was intentional, and what we can do to work around this issue. 
Thx!

Comment 4 Tomáš Nožička 2020-07-23 08:11:40 UTC

> We are working an a oVirt IPI installed cluster which uses port 9443 as internal API LB (haproxy).

This was not introduced by that PR, KCM has always listened on 9443. That is haproxy config bug that steals KCM port - introduced in https://github.com/openshift/baremetal-runtimecfg/pull/61 and fixed in https://github.com/openshift/baremetal-runtimecfg/pull/73

Comment 8 RamaKasturi 2020-08-03 07:17:32 UTC

Tried verifying the bug but i see the kcm is in crashloopback and when checked further understood that haproxy occupied the port 9443,  kcm always restarted. 
I am afraid if this will break upgrade paths of customers who has IPI installs for 4.4. Checked with dev if we can have the bug backport to all other releases once we have a proper fix in 4.5, waiting for their input. Once i receive the input will either mark this bug to verified or move it back to assigned state. Thanks !!

Comment 9 RamaKasturi 2020-08-03 07:51:32 UTC

Bug verification blocked due to https://bugzilla.redhat.com/show_bug.cgi?id=1862898

Comment 10 RamaKasturi 2020-08-05 05:56:56 UTC

Verified the bug with the payload below and on profile ipi-on-azure/versioned-installer-ovn-customer_vpc-http_proxy i see that no init-containers and the kube-controller-manager container will check the 10257

[ramakasturinarra@dhcp35-60 verification-tests]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-08-01-220435   True        False         12h     Cluster version is 4.4.0-0.nightly-2020-08-01-220435

[ramakasturinarra@dhcp35-60 verification-tests]$ oc describe pod kube-controller-manager-sunilc-4416-rfr7c-master-2 -n openshift-kube-controller-manager
Name:                 kube-controller-manager-sunilc-4416-rfr7c-master-2
Namespace:            openshift-kube-controller-manager
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 sunilc-4416-rfr7c-master-2/10.0.0.7
Start Time:           Tue, 04 Aug 2020 22:55:29 +0530
Labels:               app=kube-controller-manager
                      kube-controller-manager=true
                      revision=8
Annotations:          kubectl.kubernetes.io/default-logs-container: kube-controller-manager
                      kubernetes.io/config.hash: 52f1486d6d8b71b916383c4ba76d666c
                      kubernetes.io/config.mirror: 52f1486d6d8b71b916383c4ba76d666c
                      kubernetes.io/config.seen: 2020-08-04T17:35:01.226149575Z
                      kubernetes.io/config.source: file
Status:               Running
IP:                   10.0.0.7
IPs:
  IP:           10.0.0.7
Controlled By:  Node/sunilc-4416-rfr7c-master-2
Containers:
  kube-controller-manager:
    Container ID:  cri-o://7400effdbf8a5ea362043a0ff811f98fe0851fe40f27141229450161becdeaf6
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fed963f4a3d4fa81891976fcda8e08d970e1ddfb4076ee4e048b70c581c2c49b
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fed963f4a3d4fa81891976fcda8e08d970e1ddfb4076ee4e048b70c581c2c49b
    Port:          10257/TCP
    Host Port:     10257/TCP
    Command:
      /bin/bash
      -euxo
      pipefail
      -c
    Args:
      timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10257 \))" ]; do sleep 1; done'
      
      if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then
        echo "Copying system trust bundle"
        cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
      fi
      exec hyperkube kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml \
        --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig \
        --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt \
        --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=2 --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key
    State:          Running
      Started:      Tue, 04 Aug 2020 23:34:55 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      80m
      memory:   200Mi
    Liveness:   http-get https://:10257/healthz delay=45s timeout=10s period=10s #success=1 #failure=3
    Readiness:  http-get https://:10257/healthz delay=10s timeout=10s period=10s #success=1 #failure=3
    Environment:
      HTTPS_PROXY:  http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      HTTP_PROXY:   http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      NO_PROXY:     .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
    Mounts:
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
  cluster-policy-controller:
    Container ID:  cri-o://43a6b4d6f439af34a99b280ea9c2b64ec1dba6c02287f012670dfa7c141fd484
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bc367e8cb993f0194ad8288a29bb00e9362f9f9d123fb94c7c85f8349cd3599c
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bc367e8cb993f0194ad8288a29bb00e9362f9f9d123fb94c7c85f8349cd3599c
    Port:          10357/TCP
    Host Port:     10357/TCP
    Command:
      /bin/bash
      -euxo
      pipefail
      -c
    Args:
      timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10357 \))" ]; do sleep 1; done'
      
      exec cluster-policy-controller start --config=/etc/kubernetes/static-pod-resources/configmaps/cluster-policy-controller-config/config.yaml
      
    State:          Running
      Started:      Tue, 04 Aug 2020 23:34:56 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      10m
      memory:   200Mi
    Liveness:   http-get https://:10357/healthz delay=45s timeout=10s period=10s #success=1 #failure=3
    Readiness:  http-get https://:10357/healthz delay=10s timeout=10s period=10s #success=1 #failure=3
    Environment:
      HTTPS_PROXY:  http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      HTTP_PROXY:   http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      NO_PROXY:     .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
    Mounts:
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
  kube-controller-manager-cert-syncer:
    Container ID:  cri-o://0c065a7eaa6ae2c6330dd36164ed6f71f53ac4c11ae69de313a43ea79899e854
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
    Port:          <none>
    Host Port:     <none>
    Command:
      cluster-kube-controller-manager-operator
      cert-syncer
    Args:
      --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig
      --namespace=$(POD_NAMESPACE)
      --destination-dir=/etc/kubernetes/static-pod-certs
    State:          Running
      Started:      Tue, 04 Aug 2020 23:34:57 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     5m
      memory:  50Mi
    Environment:
      POD_NAME:       kube-controller-manager-sunilc-4416-rfr7c-master-2 (v1:metadata.name)
      POD_NAMESPACE:  openshift-kube-controller-manager (v1:metadata.namespace)
      HTTPS_PROXY:    http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      HTTP_PROXY:     http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      NO_PROXY:       .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
    Mounts:
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
  kube-controller-manager-recovery-controller:
    Container ID:  cri-o://890278d00312c0e8d1c93f7740c82dd28e6b7282f7665e48ae844e8d528d65e3
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b761bfa81fdb68866028109fb0092fc30147fa315ca17748e7d9b8c55ef5762d
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -euxo
      pipefail
      -c
    Args:
      timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 9443 \))" ]; do sleep 1; done'
      
      exec cluster-kube-controller-manager-operator cert-recovery-controller --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig --namespace=${POD_NAMESPACE} --listen=0.0.0.0:9443 -v=2
      
    State:          Running
      Started:      Tue, 04 Aug 2020 23:34:58 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     5m
      memory:  50Mi
    Environment:
      POD_NAMESPACE:  openshift-kube-controller-manager (v1:metadata.namespace)
      HTTPS_PROXY:    http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      HTTP_PROXY:     http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.99.4:3128
      NO_PROXY:       .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-0.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-1.sunilc-4416.qe.azure.devcluster.openshift.com,etcd-2.sunilc-4416.qe.azure.devcluster.openshift.com,localhost,test.no-proxy.com
    Mounts:
      /etc/kubernetes/static-pod-certs from cert-dir (rw)
      /etc/kubernetes/static-pod-resources from resource-dir (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  resource-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-8
    HostPathType:  
  cert-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/static-pod-resources/kube-controller-manager-certs
    HostPathType:  
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       op=Exists
Events:            <none>

Based on the above moving the bug to verified state.

Comment 12 errata-xmlrpc 2020-08-06 19:07:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.16 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3237

Note You need to log in before you can comment on or make changes to this bug.