1930537 – [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes

Bug 1930537 - [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes

Summary: [sig-arch] Managed cluster should have no crashlooping pods in core namespace...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Joe Lanford
QA Contact:	Tom Buskey
Docs Contact:
URL:
Whiteboard:
Depends On:	1937170
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-19 05:43 UTC by Miciah Dashiel Butler Masters
Modified:	2021-04-20 19:27 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1937167 (view as bug list)
Environment:	[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes
Last Closed:	2021-04-20 19:27:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25600	0	None	open	[release-4.6] test: improve handling of pending pods crashloop detection test	2021-03-10 02:58:59 UTC
Red Hat Product Errata	RHBA-2021:1153	0	None	None	None	2021-04-20 19:27:48 UTC

Description Miciah Dashiel Butler Masters 2021-02-19 05:43:45 UTC

The "[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes" CI test is failing frequently.  In particular, the test fails often with pods in the openshift-marketplace namespace pending the entire time.  See these search results:
https://search.ci.openshift.org/?search=Pod+openshift-marketplace%2F%5B%5E+%5D%2B+was+pending+entire+time&maxAge=336h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job


For example, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1359736927032446976 has the following error:

    fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
        <[]string | len:1, cap:1>: [
            "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error",
        ]
    to be empty

Bug 1882839 has a report of earlier failures where community-operators pods were crashlooping.

Comment 1 Nick Hale 2021-03-10 02:58:03 UTC

Checking the logs for the given example, it looks like the test is failing pods that started -- and are still pending -- within seconds of the check:

Feb 11 06:17:53.263: INFO: Pod status openshift-marketplace/community-operators-zxvzd:
{
  "phase": "Pending",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [registry-server]"
    },
    {
      "type": "ContainersReady",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [registry-server]"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z"
    }
  ],
  "hostIP": "10.0.32.4",
  "startTime": "2021-02-11T06:17:45Z",
  "containerStatuses": [
    {
      "name": "registry-server",
      "state": {
        "waiting": {
          "reason": "ContainerCreating"
        }
      },
      "lastState": {},
      "ready": false,
      "restartCount": 0,
      "image": "registry.redhat.io/redhat/community-operator-index:latest",
      "imageID": ""
    }
  ],
  "qosClass": "Burstable"
}
Feb 11 06:17:53.272: INFO: Running AfterSuite actions on all nodes
Feb 11 06:17:53.272: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error",
    ]
to be empty

Checking the code, it seems like release-4.6 is missing some logic, which exists in master, to prevent such pods from failing the test:
- master: https://github.com/openshift/origin/blob/7e958d0a1fddefe8f47c50c40c33a9c5096f2d75/test/extended/operators/cluster.go#L140
- release-4.6: https://github.com/openshift/origin/blob/ae4a31dc9325a685e050768d80670071c242e6d8/test/extended/operators/cluster.go#L134

It looks like there's already an open PR against test tests in release-4.6: https://github.com/openshift/origin/pull/25600

If it looks good, maybe we can push it through.

Comment 10 errata-xmlrpc 2021-04-20 19:27:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1153

Note You need to log in before you can comment on or make changes to this bug.