test: [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel] is failing frequently in CI, see search results: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Pod+openshift-marketplace%2Fcommunity-operators.*was+pending+entire+time%3A+unknown+error&maxAge=24h&groupBy=job' | grep 'failures match' | sort periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy - 12 runs, 100% failed, 17% of failures match promote-release-openshift-machine-os-content-e2e-aws-4.6 - 19 runs, 21% failed, 25% of failures match pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry - 2 runs, 50% failed, 100% of failures match release-openshift-ocp-installer-e2e-aws-4.6 - 14 runs, 29% failed, 25% of failures match release-openshift-ocp-installer-e2e-aws-ovn-4.6 - 11 runs, 9% failed, 200% of failures match release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 11 runs, 36% failed, 25% of failures match release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 11 runs, 64% failed, 14% of failures match release-openshift-ocp-installer-e2e-metal-4.6 - 11 runs, 73% failed, 13% of failures match release-openshift-ocp-installer-e2e-ovirt-4.6 - 12 runs, 50% failed, 17% of failures match release-openshift-origin-installer-e2e-azure-4.6 - 23 runs, 35% failed, 13% of failures match For example, [1]: fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected <[]string | len:1, cap:1>: [ "Pod openshift-marketplace/community-operators-n8c79 was pending entire time: unknown error", ] to be empty with the test-case's stdout including: Sep 25 20:28:17.236: INFO: Pod status openshift-marketplace/community-operators-n8c79: { "phase": "Pending", "conditions": [ { "type": "Initialized", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2020-09-25T20:28:11Z" }, { "type": "Ready", "status": "False", "lastProbeTime": null, "lastTransitionTime": "2020-09-25T20:28:11Z", "reason": "ContainersNotReady", "message": "containers with unready status: [registry-server]" }, In that job, the replacement pod seems to have come up fine: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1309569756491157504/artifacts/e2e-aws-proxy/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name | startswith("community-operators-")).status.phase' Running I dunno if we have much else to go on about why registry-server was slow to come up. Any ideas? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1309569756491157504
This was almost certainly occurring because the catalog image for community operators is managed and published by an outside team who was having issues bootstrapping the image. Since the image is up and running now, I do not believe this is still an issue. The image itself should start basically instantly, since it now just needs to establish a connection to a database and expose and serve an api. If this starts happening more consistently, we can revisit and hopefully provide some better guidelines for the community maintainers. As of now, I'm closing this as WORKSFORME.