Description of problem: On ovirt CI we see the test case: "Managed cluster should have no crashlooping pods in core namespaces over four minutes" failing 100% of the times due to 'kube-controller-manager-recovery-controller' crash looping: " fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected <[]string | len:3, cap:4>: [ "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-2 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-2_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)", "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-0 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-0_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)", "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-1 is not healthy: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)", ] to be empty " on the logs we see: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller " { "name": "kube-controller-manager-recovery-controller", "state": { "waiting": { "reason": "CrashLoopBackOff", "message": "back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)" } }, "lastState": { "terminated": { "exitCode": 124, "reason": "Error", "message": " [::ffff:192.168.216.1]:35060 timer:(timewait,44sec,0)\nESTAB 0 0 [::1]:9443 [::1]:48768 \nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:35090 timer:(timewait,45sec,0)\nESTAB 0 0 [::1]:9443 [::1]:49620 \nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34220 timer:(timewait,24sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34010 timer:(timewait,6.938ms,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:35496 timer:(timewait,58sec,0)\nESTAB 0 0 [::1]:9443 [::1]:56764 \nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34750 timer:(timewait,37sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34638 timer:(timewait,35sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34530 timer:(timewait,31sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34142 timer:(timewait,14sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:33950 timer:(timewait,4.753ms,0)\nESTAB 0 0 [::1]:9443 [::1]:50332 \nESTAB 0 0 [::1]:9443 [::1]:42128 \nESTAB 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34778 \nESTAB 0 0 [::1]:9443 [::1]:54912 \nESTAB 0 0 [::1]:9443 [::1]:49762 ' ']'\n+ sleep 1\n", "startedAt": "2020-07-04T20:17:16Z", "finishedAt": "2020-07-04T20:20:16Z", "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca" } }, "ready": false, "restartCount": 6, "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84", "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca", "started": false } " We started seeing failures due to this from 26-6, around the time that was merged: https://github.com/openshift/cluster-kube-controller-manager-operator/commit/88dc303df2fd687540f9d80d3a2b32561fb22eb4 You can see this on any ovirt run since 26-6 for example: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1279687231723802624
I know that there is an open bug on that test case [1] but I believe this is a different reason. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1842002
This is just a wild guess because I don't know the code, but I see that the other pods have: ports.containerPort startupProbe, livenessProbe readinessProbe but kube-controller-manager-recovery-controller doesn't
This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389 and backports to older versions are on the way. *** This bug has been marked as a duplicate of bug 1851389 ***
(In reply to Maciej Szulik from comment #4) > This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389 > and backports to older versions are on the way. > > *** This bug has been marked as a duplicate of bug 1851389 *** I think that this failure is caused because of the bug[1] fix[2] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1851389 [2] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421
Both are needed and both are in-progress. The de-duplication still makes sense. *** This bug has been marked as a duplicate of bug 1851389 ***
Sorry I edited the fields and didn't see you closed it again *** This bug has been marked as a duplicate of bug 1851389 ***
Reopening this, after a talk with Maciej Szulik. The new suspect is the combination on [1] and [2]. PR[1] added logic for checking port availability in recovery-controller. PR[2] change the port of HAProxy to 9443, due to yet another port conflict. Both of the PRs cause the kube-controller-manager-recovery-controller to crash loop. [1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421 [2] https://github.com/openshift/baremetal-runtimecfg/pull/59
Yeah, that links that Gal pointed in the previous comment are the main reason this is failing consistently. I wonder why only now this popped up, when cluster-policy-controller is using 9443 port since version 4.3, at least. I'm moving this to oVirt team to fix it.
My bad, it's recovery controller that is using 9443, not cpc
Verified with CI run results
Hello Team Will the solution of this issue be backported to 4.4
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196