Bug 1858498

Summary:	Haproxy 9443 port conflicts with KCM causing KCM in crashloopbackoff state (vSphere, RHV)
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Installer	Assignee:	Gal Zaidman <gzaidman>
Installer sub component:	OpenShift on RHV	QA Contact:	Guilherme Santos <gdeolive>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	adeshpan, aos-bugs, bperkins, dahernan, dbewley, ddreggor, fhirtz, hmarques, hpopal, igor.tiunov, jima, jmalde, jrosenta, jsafarik, kfryklun, knarra, lars, lleistne, lmartinh, maszulik, mfojtik, mifiedle, mrhodes, tnozicka, wjiang, wking, xtian, yprokule
Version:	4.4	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-17 20:05:57 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1853889
Bug Blocks:	1862898

Comment 1 RamaKasturi 2020-07-21 13:32:19 UTC

We have hit this bug twice on vsphere where haproxy uses 9443 port conflict with KCM due to which KCM is in crashloopbackoff state.

oc get pods -A |awk '$5 >10'
NAMESPACE                                          NAME                                                      READY   STATUS              RESTARTS   AGE
openshift-kube-controller-manager                  kube-controller-manager-scheng-45-tqgsd-master-0          4/4     Running             18         121m
openshift-kube-controller-manager                  kube-controller-manager-scheng-45-tqgsd-master-1          3/4     CrashLoopBackOff    21         120m
openshift-kube-controller-manager                  kube-controller-manager-scheng-45-tqgsd-master-2          3/4     CrashLoopBackOff    18         120m

$ oc exec -n openshift-vsphere-infra haproxy-scheng-45-tqgsd-master-0 -- cat /etc/haproxy/haproxy.cfg | grep bind
Defaulting container name to haproxy.
Use 'oc describe pod/haproxy-scheng-45-tqgsd-master-0 -n openshift-vsphere-infra' to see all of the containers in this pod.
  bind :::9443 v4v6

Payload where this bug was hit is 4.5.0-0.nightly-2020-07-20-152128 and the profile name is "ipi-on-vsphere/versioned-installer-6_7-disconnected-vsphere_slave-ci"

Comment 2 Mike Fiedler 2020-07-21 18:22:24 UTC

I reproduced this 2/2 time in a vsphere disconnected cluster on 4.5.3 stable

Comment 5 jima 2020-07-23 01:41:52 UTC

verify the bug on ocp ipi on vsphere with nightly build: 4.5.0-0.nightly-2020-07-21-232150 and passed.

The port which haporxy pod used has been changed to 9445, and KCM pod is not in CrashLoopBackOff state.
$ oc get pod -A | grep kube-controller | grep -v Completed
openshift-kube-controller-manager-operator         kube-controller-manager-operator-5c9d8bd7d4-cms6b            1/1     Running     1          111m
openshift-kube-controller-manager                  kube-controller-manager-jima-072203-6mfqb-master-0           4/4     Running     5          96m
openshift-kube-controller-manager                  kube-controller-manager-jima-072203-6mfqb-master-1           4/4     Running     0          97m
openshift-kube-controller-manager                  kube-controller-manager-jima-072203-6mfqb-master-2           4/4     Running     6          97m

Comment 6 Tomáš Nožička 2020-07-27 15:29:02 UTC

*** Bug 1860190 has been marked as a duplicate of this bug. ***

Comment 9 Tomáš Nožička 2020-07-28 14:06:06 UTC

*** Bug 1861275 has been marked as a duplicate of this bug. ***

Comment 10 Tomáš Nožička 2020-07-28 15:02:53 UTC

please also verify that if you have previously a broken cluster, upgrading to a payload having this fix actually works

Comment 11 Lars Kellogg-Stedman 2020-07-28 15:41:31 UTC

I upgraded from 4.4.13 -> 4.5.4, and that seems to have resulted in a stable environment.

Comment 12 RamaKasturi 2020-07-28 15:47:02 UTC

(In reply to Lars Kellogg-Stedman from comment #11)
> I upgraded from 4.4.13 -> 4.5.4, and that seems to have resulted in a stable
> environment.

I think upgrading from a 4.4.13 -> 4.5.4 will work, but upgrading from a broken cluster to the payload which has the fix needs to be checked.

Comment 20 Keith Fryklund 2020-07-30 19:36:46 UTC

Hey folks, 

I want to note that I hit this in four of my Openshift on Openstack 4.5.3 clusters.  I followed this article [1] to fix them.  


[1] https://access.redhat.com/solutions/5266321

Comment 25 Guilherme Santos 2020-08-07 14:53:35 UTC

Verified on:
openshift-4.5.4 upgrading from 4.5.3

Steps:
1. had a broken 4.5.3 cluster deployed:
# oc -n openshift-kube-controller-manager get pods | grep kube-controller
kube-controller-manager-secondary-42spd-master-0   4/4     Running            15         56m
kube-controller-manager-secondary-42spd-master-1   3/4     CrashLoopBackOff   12         55m
kube-controller-manager-secondary-42spd-master-2   3/4     CrashLoopBackOff   16         54m
# oc -n openshift-ovirt-infra exec haproxy-secondary-42spd-master-0 -- cat /etc/haproxy/haproxy.cfg | grep bind
Defaulting container name to haproxy.
Use 'oc describe pod/haproxy-secondary-42spd-master-0 -n openshift-ovirt-infra' to see all of the containers in this pod.
  bind :::9443 v4v6
  bind :::50936 v4v6
  bind 127.0.0.1:50000

2. upgraded the cluster
# oc adm upgrade --to=4.5.4 --force=true

Results:
broken cluster fixed on upgrade and running as expected
# oc -n openshift-kube-controller-manager get pods | grep kube-controller
kube-controller-manager-secondary-42spd-master-0   4/4     Running     4          156m
kube-controller-manager-secondary-42spd-master-1   4/4     Running     9          161m
kube-controller-manager-secondary-42spd-master-2   4/4     Running     0          158m
# oc -n openshift-ovirt-infra exec haproxy-secondary-42spd-master-0 -- cat /etc/haproxy/haproxy.cfg | grep bind
Defaulting container name to haproxy.
Use 'oc describe pod/haproxy-secondary-42spd-master-0 -n openshift-ovirt-infra' to see all of the containers in this pod.
  bind :::9445 v4v6
  bind :::50936 v4v6
  bind 127.0.0.1:50000

Additional info:
the upgrade took a while and failed a few times, however, even with failing, it continues and in the end it managed to finish everything by itself

Comment 27 Mike Fedosin 2020-08-11 17:56:14 UTC

*** Bug 1865944 has been marked as a duplicate of this bug. ***

Comment 28 David Dreeggors 2020-08-12 14:29:13 UTC

As mentioned in the comments of the previously mentioned article [1], the workaround gets reverted by the haproxy-monitor container. So sadly that is not a valid workaround.

[1] https://access.redhat.com/solutions/5266321

Comment 31 errata-xmlrpc 2020-08-17 20:05:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3330

Comment 32 Mike Fiedler 2020-08-25 17:46:20 UTC

This bug is closed as fixed in 4.5.  If you need a fix for 4.4 please open a bug with 4.4.z as the target release.  You can clone this bug to do that - upper right hand corner.

Comment 33 RamaKasturi 2020-08-26 05:23:44 UTC

(In reply to Mike Fiedler from comment #32)
> This bug is closed as fixed in 4.5.  If you need a fix for 4.4 please open a
> bug with 4.4.z as the target release.  You can clone this bug to do that -
> upper right hand corner.

@mike, fix went in for  4.4.z as well, here is the bug https://bugzilla.redhat.com/show_bug.cgi?id=1862898