Bug 1930537 - [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes
Summary: [sig-arch] Managed cluster should have no crashlooping pods in core namespace...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.z
Assignee: Joe Lanford
QA Contact: Tom Buskey
URL:
Whiteboard:
Depends On: 1937170
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-19 05:43 UTC by Miciah Dashiel Butler Masters
Modified: 2021-04-20 19:27 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1937167 (view as bug list)
Environment:
[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes
Last Closed: 2021-04-20 19:27:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25600 0 None open [release-4.6] test: improve handling of pending pods crashloop detection test 2021-03-10 02:58:59 UTC
Red Hat Product Errata RHBA-2021:1153 0 None None None 2021-04-20 19:27:48 UTC

Description Miciah Dashiel Butler Masters 2021-02-19 05:43:45 UTC
The "[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes" CI test is failing frequently.  In particular, the test fails often with pods in the openshift-marketplace namespace pending the entire time.  See these search results:
https://search.ci.openshift.org/?search=Pod+openshift-marketplace%2F%5B%5E+%5D%2B+was+pending+entire+time&maxAge=336h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job


For example, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1359736927032446976 has the following error:

    fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
        <[]string | len:1, cap:1>: [
            "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error",
        ]
    to be empty

Bug 1882839 has a report of earlier failures where community-operators pods were crashlooping.

Comment 1 Nick Hale 2021-03-10 02:58:03 UTC
Checking the logs for the given example, it looks like the test is failing pods that started -- and are still pending -- within seconds of the check:

Feb 11 06:17:53.263: INFO: Pod status openshift-marketplace/community-operators-zxvzd:
{
  "phase": "Pending",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [registry-server]"
    },
    {
      "type": "ContainersReady",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [registry-server]"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-02-11T06:17:45Z"
    }
  ],
  "hostIP": "10.0.32.4",
  "startTime": "2021-02-11T06:17:45Z",
  "containerStatuses": [
    {
      "name": "registry-server",
      "state": {
        "waiting": {
          "reason": "ContainerCreating"
        }
      },
      "lastState": {},
      "ready": false,
      "restartCount": 0,
      "image": "registry.redhat.io/redhat/community-operator-index:latest",
      "imageID": ""
    }
  ],
  "qosClass": "Burstable"
}
Feb 11 06:17:53.272: INFO: Running AfterSuite actions on all nodes
Feb 11 06:17:53.272: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/operators/cluster.go:151]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-marketplace/community-operators-zxvzd was pending entire time: unknown error",
    ]
to be empty

Checking the code, it seems like release-4.6 is missing some logic, which exists in master, to prevent such pods from failing the test:
- master: https://github.com/openshift/origin/blob/7e958d0a1fddefe8f47c50c40c33a9c5096f2d75/test/extended/operators/cluster.go#L140
- release-4.6: https://github.com/openshift/origin/blob/ae4a31dc9325a685e050768d80670071c242e6d8/test/extended/operators/cluster.go#L134

It looks like there's already an open PR against test tests in release-4.6: https://github.com/openshift/origin/pull/25600

If it looks good, maybe we can push it through.

Comment 10 errata-xmlrpc 2021-04-20 19:27:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1153


Note You need to log in before you can comment on or make changes to this bug.