1851390 – kcm pod crashloops because port is already in use

Bug 1851390 - kcm pod crashloops because port is already in use

Summary: kcm pod crashloops because port is already in use

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Tomáš Nožička
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1851389
Blocks:	1851397 1851404
TreeView+	depends on / blocked

Reported:	2020-06-26 12:12 UTC by Tomáš Nožička
Modified:	2020-07-22 12:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1851389
Clones:	1851397 1851404 (view as bug list)
Environment:
Last Closed:	2020-07-22 12:20:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 422	None	closed	[release-4.5] Bug 1851390: Fix port check	2020-09-11 07:08:55 UTC
Github	openshift images pull 20	None	closed	[release-4.5] Bug 1851390: Add iproute package to get `ss` tool for port wait	2020-09-11 07:08:56 UTC
Red Hat Product Errata	RHBA-2020:2956	None	None	None	2020-07-22 12:21:08 UTC

Description Tomáš Nožička 2020-06-26 12:12:04 UTC

+++ This bug was initially created as a clone of Bug #1851389 +++

kcm pod crashloops because port is already in use. I saw a case with cluster-policy-manager container but it's not limited to it.

Crashlooping triggers alerts and adds backoff for the pod so it start slower.

Container can be restarted while the pods stays. For that reason, we need to check the port availability in the same process as we listen, not in an init container which isn't re-run.

Comment 5 Ke Wang 2020-07-21 07:02:21 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-07-20-152128   True        False         108m    Error while reconciling 4.5.0-0.nightly-2020-07-20-152128: the cluster operator kube-controller-manager is degraded

$ oc exec -n openshift-vsphere-infra haproxy-scheng-45-tqgsd-master-0 -- cat /etc/haproxy/haproxy.cfg | grep bind
Defaulting container name to haproxy.
Use 'oc describe pod/haproxy-scheng-45-tqgsd-master-0 -n openshift-vsphere-infra' to see all of the containers in this pod.
  bind :::9443 v4v6


The haproxy occupied the port 9443,  kcm always restarted

$ oc get pods -A |awk '$5 >10'
NAMESPACE                                          NAME                                                      READY   STATUS              RESTARTS   AGE
openshift-kube-controller-manager                  kube-controller-manager-scheng-45-tqgsd-master-0          4/4     Running             18         121m
openshift-kube-controller-manager                  kube-controller-manager-scheng-45-tqgsd-master-1          3/4     CrashLoopBackOff    21         120m
openshift-kube-controller-manager                  kube-controller-manager-scheng-45-tqgsd-master-2          3/4     CrashLoopBackOff    18         120m

Comment 6 Ke Wang 2020-07-21 07:50:32 UTC

Above problem was tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1858498

Comment 8 errata-xmlrpc 2020-07-22 12:20:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2956

Note You need to log in before you can comment on or make changes to this bug.