Bug 1930537

Summary:	[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes
Product:	OpenShift Container Platform	Reporter:	Miciah Dashiel Butler Masters <mmasters>
Component:	OLM	Assignee:	Joe Lanford <jlanford>
OLM sub component:	OperatorHub	QA Contact:	Tom Buskey <tbuskey>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	jdelft, nhale
Version:	4.6
Target Milestone:	---
Target Release:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1937167 (view as bug list)		Environment:	[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes
Last Closed:	2021-04-20 19:27:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1937170
Bug Blocks:

Description Miciah Dashiel Butler Masters 2021-02-19 05:43:45 UTC

The "[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes" CI test is failing frequently.  In particular, the test fails often with pods in the openshift-marketplace namespace pending the entire time.  See these search results:
https://search.ci.openshift.org/?search=Pod+openshift-marketplace%2F%5B%5E+%5D%2B+was+pending+entire+time&maxAge=336h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job


For example, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1359736927032446976 has the following error:

    fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
        <[]string | len:1, cap:1>: [
            "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error",
        ]
    to be empty

Bug 1882839 has a report of earlier failures where community-operators pods were crashlooping.

Comment 1 Nick Hale 2021-03-10 02:58:03 UTC

Checking the logs for the given example, it looks like the test is failing pods that started -- and are still pending -- within seconds of the check:

Feb 11 06:17:53.263: INFO: Pod status openshift-marketplace/community-operators-zxvzd:
{
  "phase": "Pending",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [registry-server]"
    },
    {
      "type": "ContainersReady",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [registry-server]"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z"
    }
  ],
  "hostIP": "10.0.32.4",
  "startTime": "2021-02-11T06:17:45Z",
  "containerStatuses": [
    {
      "name": "registry-server",
      "state": {
        "waiting": {
          "reason": "ContainerCreating"
        }
      },
      "lastState": {},
      "ready": false,
      "restartCount": 0,
      "image": "registry.redhat.io/redhat/community-operator-index:latest",
      "imageID": ""
    }
  ],
  "qosClass": "Burstable"
}
Feb 11 06:17:53.272: INFO: Running AfterSuite actions on all nodes
Feb 11 06:17:53.272: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error",
    ]
to be empty

Checking the code, it seems like release-4.6 is missing some logic, which exists in master, to prevent such pods from failing the test:
- master: https://github.com/openshift/origin/blob/7e958d0a1fddefe8f47c50c40c33a9c5096f2d75/test/extended/operators/cluster.go#L140
- release-4.6: https://github.com/openshift/origin/blob/ae4a31dc9325a685e050768d80670071c242e6d8/test/extended/operators/cluster.go#L134

It looks like there's already an open PR against test tests in release-4.6: https://github.com/openshift/origin/pull/25600

If it looks good, maybe we can push it through.

Comment 10 errata-xmlrpc 2021-04-20 19:27:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1153