Bug 1853889
| Summary: | [ovirt] test case "Managed cluster should have no crashlooping pods in core namespaces over four minutes" 100% failure | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Gal Zaidman <gzaidman> |
| Component: | Installer | Assignee: | Gal Zaidman <gzaidman> |
| Installer sub component: | OpenShift on RHV | QA Contact: | Lucie Leistnerova <lleistne> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | aos-bugs, hpopal, maszulik, mfojtik, ssonigra, wking, xtian |
| Version: | 4.4 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:12:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1858498 | ||
I know that there is an open bug on that test case [1] but I believe this is a different reason. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1842002 This is just a wild guess because I don't know the code, but I see that the other pods have: ports.containerPort startupProbe, livenessProbe readinessProbe but kube-controller-manager-recovery-controller doesn't This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389 and backports to older versions are on the way. *** This bug has been marked as a duplicate of bug 1851389 *** (In reply to Maciej Szulik from comment #4) > This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389 > and backports to older versions are on the way. > > *** This bug has been marked as a duplicate of bug 1851389 *** I think that this failure is caused because of the bug[1] fix[2] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1851389 [2] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421 Both are needed and both are in-progress. The de-duplication still makes sense. *** This bug has been marked as a duplicate of bug 1851389 *** Sorry I edited the fields and didn't see you closed it again *** This bug has been marked as a duplicate of bug 1851389 *** Reopening this, after a talk with Maciej Szulik. The new suspect is the combination on [1] and [2]. PR[1] added logic for checking port availability in recovery-controller. PR[2] change the port of HAProxy to 9443, due to yet another port conflict. Both of the PRs cause the kube-controller-manager-recovery-controller to crash loop. [1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421 [2] https://github.com/openshift/baremetal-runtimecfg/pull/59 Yeah, that links that Gal pointed in the previous comment are the main reason this is failing consistently. I wonder why only now this popped up, when cluster-policy-controller is using 9443 port since version 4.3, at least. I'm moving this to oVirt team to fix it. My bad, it's recovery controller that is using 9443, not cpc Verified with CI run results Hello Team Will the solution of this issue be backported to 4.4 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Description of problem: On ovirt CI we see the test case: "Managed cluster should have no crashlooping pods in core namespaces over four minutes" failing 100% of the times due to 'kube-controller-manager-recovery-controller' crash looping: " fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected <[]string | len:3, cap:4>: [ "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-2 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-2_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)", "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-0 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-0_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)", "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-1 is not healthy: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)", ] to be empty " on the logs we see: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller " { "name": "kube-controller-manager-recovery-controller", "state": { "waiting": { "reason": "CrashLoopBackOff", "message": "back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)" } }, "lastState": { "terminated": { "exitCode": 124, "reason": "Error", "message": " [::ffff:192.168.216.1]:35060 timer:(timewait,44sec,0)\nESTAB 0 0 [::1]:9443 [::1]:48768 \nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:35090 timer:(timewait,45sec,0)\nESTAB 0 0 [::1]:9443 [::1]:49620 \nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34220 timer:(timewait,24sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34010 timer:(timewait,6.938ms,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:35496 timer:(timewait,58sec,0)\nESTAB 0 0 [::1]:9443 [::1]:56764 \nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34750 timer:(timewait,37sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34638 timer:(timewait,35sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34530 timer:(timewait,31sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34142 timer:(timewait,14sec,0)\nTIME-WAIT 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:33950 timer:(timewait,4.753ms,0)\nESTAB 0 0 [::1]:9443 [::1]:50332 \nESTAB 0 0 [::1]:9443 [::1]:42128 \nESTAB 0 0 [::ffff:192.168.216.111]:9443 [::ffff:192.168.216.1]:34778 \nESTAB 0 0 [::1]:9443 [::1]:54912 \nESTAB 0 0 [::1]:9443 [::1]:49762 ' ']'\n+ sleep 1\n", "startedAt": "2020-07-04T20:17:16Z", "finishedAt": "2020-07-04T20:20:16Z", "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca" } }, "ready": false, "restartCount": 6, "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84", "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca", "started": false } " We started seeing failures due to this from 26-6, around the time that was merged: https://github.com/openshift/cluster-kube-controller-manager-operator/commit/88dc303df2fd687540f9d80d3a2b32561fb22eb4 You can see this on any ovirt run since 26-6 for example: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1279687231723802624