1882839 – Pod openshift-marketplace/community-operators-... was pending entire time: registry-server container not ready

Bug 1882839 - Pod openshift-marketplace/community-operators-... was pending entire time: registry-server container not ready

Summary: Pod openshift-marketplace/community-operators-... was pending entire time: re...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Evan Cordell
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-25 21:34 UTC by W. Trevor King
Modified:	2020-10-28 18:00 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel]
Last Closed:	2020-10-28 18:00:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2020-09-25 21:34:46 UTC

test:
[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel]

is failing frequently in CI, see search results:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Pod+openshift-marketplace%2Fcommunity-operators.*was+pending+entire+time%3A+unknown+error&maxAge=24h&groupBy=job' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy - 12 runs, 100% failed, 17% of failures match
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 19 runs, 21% failed, 25% of failures match
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry - 2 runs, 50% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-4.6 - 14 runs, 29% failed, 25% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.6 - 11 runs, 9% failed, 200% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 11 runs, 36% failed, 25% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 11 runs, 64% failed, 14% of failures match
release-openshift-ocp-installer-e2e-metal-4.6 - 11 runs, 73% failed, 13% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 12 runs, 50% failed, 17% of failures match
release-openshift-origin-installer-e2e-azure-4.6 - 23 runs, 35% failed, 13% of failures match

For example, [1]:

fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-marketplace/community-operators-n8c79 was pending entire time: unknown error",
    ]
to be empty

with the test-case's stdout including:

Sep 25 20:28:17.236: INFO: Pod status openshift-marketplace/community-operators-n8c79:
{
  "phase": "Pending",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2020-09-25T20:28:11Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2020-09-25T20:28:11Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [registry-server]"
    },

In that job, the replacement pod seems to have come up fine:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1309569756491157504/artifacts/e2e-aws-proxy/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name | startswith("community-operators-")).status.phase'
Running

I dunno if we have much else to go on about why registry-server was slow to come up.  Any ideas?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1309569756491157504

Comment 3 Kevin Rizza 2020-10-28 18:00:09 UTC

This was almost certainly occurring because the catalog image for community operators is managed and published by an outside team who was having issues bootstrapping the image. Since the image is up and running now, I do not believe this is still an issue. The image itself should start basically instantly, since it now just needs to establish a connection to a database and expose and serve an api.

If this starts happening more consistently, we can revisit and hopefully provide some better guidelines for the community maintainers. As of now, I'm closing this as WORKSFORME.

Note You need to log in before you can comment on or make changes to this bug.