We have hit this bug twice on vsphere where haproxy uses 9443 port conflict with KCM due to which KCM is in crashloopbackoff state. oc get pods -A |awk '$5 >10' NAMESPACE NAME READY STATUS RESTARTS AGE openshift-kube-controller-manager kube-controller-manager-scheng-45-tqgsd-master-0 4/4 Running 18 121m openshift-kube-controller-manager kube-controller-manager-scheng-45-tqgsd-master-1 3/4 CrashLoopBackOff 21 120m openshift-kube-controller-manager kube-controller-manager-scheng-45-tqgsd-master-2 3/4 CrashLoopBackOff 18 120m $ oc exec -n openshift-vsphere-infra haproxy-scheng-45-tqgsd-master-0 -- cat /etc/haproxy/haproxy.cfg | grep bind Defaulting container name to haproxy. Use 'oc describe pod/haproxy-scheng-45-tqgsd-master-0 -n openshift-vsphere-infra' to see all of the containers in this pod. bind :::9443 v4v6 Payload where this bug was hit is 4.5.0-0.nightly-2020-07-20-152128 and the profile name is "ipi-on-vsphere/versioned-installer-6_7-disconnected-vsphere_slave-ci"
I reproduced this 2/2 time in a vsphere disconnected cluster on 4.5.3 stable
verify the bug on ocp ipi on vsphere with nightly build: 4.5.0-0.nightly-2020-07-21-232150 and passed. The port which haporxy pod used has been changed to 9445, and KCM pod is not in CrashLoopBackOff state. $ oc get pod -A | grep kube-controller | grep -v Completed openshift-kube-controller-manager-operator kube-controller-manager-operator-5c9d8bd7d4-cms6b 1/1 Running 1 111m openshift-kube-controller-manager kube-controller-manager-jima-072203-6mfqb-master-0 4/4 Running 5 96m openshift-kube-controller-manager kube-controller-manager-jima-072203-6mfqb-master-1 4/4 Running 0 97m openshift-kube-controller-manager kube-controller-manager-jima-072203-6mfqb-master-2 4/4 Running 6 97m
*** Bug 1860190 has been marked as a duplicate of this bug. ***
*** Bug 1861275 has been marked as a duplicate of this bug. ***
please also verify that if you have previously a broken cluster, upgrading to a payload having this fix actually works
I upgraded from 4.4.13 -> 4.5.4, and that seems to have resulted in a stable environment.
(In reply to Lars Kellogg-Stedman from comment #11) > I upgraded from 4.4.13 -> 4.5.4, and that seems to have resulted in a stable > environment. I think upgrading from a 4.4.13 -> 4.5.4 will work, but upgrading from a broken cluster to the payload which has the fix needs to be checked.
Hey folks, I want to note that I hit this in four of my Openshift on Openstack 4.5.3 clusters. I followed this article [1] to fix them. [1] https://access.redhat.com/solutions/5266321
Verified on: openshift-4.5.4 upgrading from 4.5.3 Steps: 1. had a broken 4.5.3 cluster deployed: # oc -n openshift-kube-controller-manager get pods | grep kube-controller kube-controller-manager-secondary-42spd-master-0 4/4 Running 15 56m kube-controller-manager-secondary-42spd-master-1 3/4 CrashLoopBackOff 12 55m kube-controller-manager-secondary-42spd-master-2 3/4 CrashLoopBackOff 16 54m # oc -n openshift-ovirt-infra exec haproxy-secondary-42spd-master-0 -- cat /etc/haproxy/haproxy.cfg | grep bind Defaulting container name to haproxy. Use 'oc describe pod/haproxy-secondary-42spd-master-0 -n openshift-ovirt-infra' to see all of the containers in this pod. bind :::9443 v4v6 bind :::50936 v4v6 bind 127.0.0.1:50000 2. upgraded the cluster # oc adm upgrade --to=4.5.4 --force=true Results: broken cluster fixed on upgrade and running as expected # oc -n openshift-kube-controller-manager get pods | grep kube-controller kube-controller-manager-secondary-42spd-master-0 4/4 Running 4 156m kube-controller-manager-secondary-42spd-master-1 4/4 Running 9 161m kube-controller-manager-secondary-42spd-master-2 4/4 Running 0 158m # oc -n openshift-ovirt-infra exec haproxy-secondary-42spd-master-0 -- cat /etc/haproxy/haproxy.cfg | grep bind Defaulting container name to haproxy. Use 'oc describe pod/haproxy-secondary-42spd-master-0 -n openshift-ovirt-infra' to see all of the containers in this pod. bind :::9445 v4v6 bind :::50936 v4v6 bind 127.0.0.1:50000 Additional info: the upgrade took a while and failed a few times, however, even with failing, it continues and in the end it managed to finish everything by itself
*** Bug 1865944 has been marked as a duplicate of this bug. ***
As mentioned in the comments of the previously mentioned article [1], the workaround gets reverted by the haproxy-monitor container. So sadly that is not a valid workaround. [1] https://access.redhat.com/solutions/5266321
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.6 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3330
This bug is closed as fixed in 4.5. If you need a fix for 4.4 please open a bug with 4.4.z as the target release. You can clone this bug to do that - upper right hand corner.
(In reply to Mike Fiedler from comment #32) > This bug is closed as fixed in 4.5. If you need a fix for 4.4 please open a > bug with 4.4.z as the target release. You can clone this bug to do that - > upper right hand corner. @mike, fix went in for 4.4.z as well, here is the bug https://bugzilla.redhat.com/show_bug.cgi?id=1862898