The "[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes" CI test is failing frequently. In particular, the test fails often with pods in the openshift-marketplace namespace pending the entire time. See these search results: https://search.ci.openshift.org/?search=Pod+openshift-marketplace%2F%5B%5E+%5D%2B+was+pending+entire+time&maxAge=336h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job For example, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1359736927032446976 has the following error: fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error", ] to be empty Bug 1882839 has a report of earlier failures where community-operators pods were crashlooping.
Checking the logs for the given example, it looks like the test is failing pods that started -- and are still pending -- within seconds of the check: Feb 11 06:17:53.263: INFO: Pod status openshift-marketplace/community-operators-zxvzd: { "phase": "Pending", "conditions": [ { "type": "Initialized", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-02-11T06:17:45Z" }, { "type": "Ready", "status": "False", "lastProbeTime": null, "lastTransitionTime": "2021-02-11T06:17:45Z", "reason": "ContainersNotReady", "message": "containers with unready status: [registry-server]" }, { "type": "ContainersReady", "status": "False", "lastProbeTime": null, "lastTransitionTime": "2021-02-11T06:17:45Z", "reason": "ContainersNotReady", "message": "containers with unready status: [registry-server]" }, { "type": "PodScheduled", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-02-11T06:17:45Z" } ], "hostIP": "10.0.32.4", "startTime": "2021-02-11T06:17:45Z", "containerStatuses": [ { "name": "registry-server", "state": { "waiting": { "reason": "ContainerCreating" } }, "lastState": {}, "ready": false, "restartCount": 0, "image": "registry.redhat.io/redhat/community-operator-index:latest", "imageID": "" } ], "qosClass": "Burstable" } Feb 11 06:17:53.272: INFO: Running AfterSuite actions on all nodes Feb 11 06:17:53.272: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error", ] to be empty Checking the code, it seems like release-4.6 is missing some logic, which exists in master, to prevent such pods from failing the test: - master: https://github.com/openshift/origin/blob/7e958d0a1fddefe8f47c50c40c33a9c5096f2d75/test/extended/operators/cluster.go#L140 - release-4.6: https://github.com/openshift/origin/blob/ae4a31dc9325a685e050768d80670071c242e6d8/test/extended/operators/cluster.go#L134 It looks like there's already an open PR against test tests in release-4.6: https://github.com/openshift/origin/pull/25600 If it looks good, maybe we can push it through.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1153