1949991 – openshift-marketplace pods are crashlooping

Bug 1949991 - openshift-marketplace pods are crashlooping

Summary: openshift-marketplace pods are crashlooping

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Anik
QA Contact:	Tom Buskey
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1949337 1951617 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-15 14:39 UTC by Oleg Bulatov
Modified:	2021-07-27 23:01 UTC (History)
CC List:	7 users (show)
Fixed In Version:	Changes submitted for bug 1949991 Email sent to: anbhatta@redhat.com, rgudimet@redhat.com, inout@strikr.io, sankarshan.mukhopadhyay@gmail.com, nhale@redhat.com, jason.brenneman@bkfs.com, dsover@redhat.com, tbuskey@redhat.com, obulatov@redhat.com, ccornejo@redhat.com, krizza@redhat.com, lakshmi.ravichandran1@ibm.com, ableisch@redhat.com
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over four minutes [Suite:openshift/conformance/parallel]
Last Closed:	2021-07-27 23:01:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 26113	0	None	open	Bug 1949991: Ignore Catalog update pods in openshift/conformance/parallel	2021-04-28 23:09:35 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:01:31 UTC

Description Oleg Bulatov 2021-04-15 14:39:15 UTC

test: Managed cluster should have no crashlooping pods in core namespaces over four minutes

found that pods in the openshift-marketplace namespace sometimes are crashlooping.

https://triage.dptools.openshift.org/?text=openshift-marketplace&job=4.8&test=Managed%20cluster%20should%20have%20no%20crashlooping%20pods%20in%20core%20namespaces%20over%20four%20minutes

Comment 1 Daniel Sover 2021-04-20 16:57:33 UTC


*** This bug has been marked as a duplicate of bug 1949337 ***

Comment 2 Anik 2021-04-20 20:11:32 UTC

Closing 1949337 as the duplicate and keeping re-opening this for tracking.

Comment 3 Anik 2021-04-20 20:12:05 UTC

*** Bug 1949337 has been marked as a duplicate of this bug. ***

Comment 4 Anik 2021-04-20 20:18:22 UTC

Looks like this could have been caused due to bad default CatalogSource images built in the pipeline. It appears to be transient however, and the images fixed since the last case of crashlooping pods. Running the marketplace-operator from latest master in a 4.8.0-0.ci-2021-04-20-092252 cluster did not show any crahlooping pods in the openshift-marketplace namespace: 

```
$ oc get pods 
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-ttbgw               1/1     Running   0          121m
community-operators-htdwg               1/1     Running   0          121m
marketplace-operator-7fd8f5d9fb-tbzld   1/1     Running   0          17m
redhat-marketplace-6trz7                1/1     Running   0          121m
redhat-operators-p6n9q                  1/1     Running   0          105m

```

Comment 5 Oleg Bulatov 2021-04-21 13:52:33 UTC

The problem is still there, please don't close this BZ until the CI failure rate decreases:

https://triage.dptools.openshift.org/?text=openshift-marketplace&job=4.8&test=Managed%20cluster%20should%20have%20no%20crashlooping%20pods%20in%20core%20namespaces%20over%20four%20minutes

An example of a failed job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-azure/1384791615964450816

Failure message:

fail [github.com/openshift/origin/test/extended/operators/cluster.go:160]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-marketplace/community-operators-28lkj is not healthy: container registry-server exited with non-zero exit code",
    ]
to be empty

Either these pods should be fixed and shouldn't have non-zero exit codes, or an exception should be added for them [1]. As this pod is not present after e2e tests have finished, most likely this pod was created (and deleted) by a test.

[1]: https://github.com/openshift/origin/blob/e945cb88da780e21c021b6c8b430454bcfb881cf/test/extended/operators/cluster.go#L47

Comment 6 Kevin Rizza 2021-04-21 19:09:50 UTC

Moving the priority of this bz to urgent given that this is blocking CI

Comment 7 Anik 2021-04-21 20:53:57 UTC

Since the pods are getting deleted after the tests, the logs from the tests have no indication regarding why the pods were getting killed in a loop apart from the log that states that the registry container exit with exit code 2. Opened https://bugzilla.redhat.com/show_bug.cgi?id=1952238 to report back logs from catalog pods to catalog operator on termination. 
Once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1952238 goes in, we'll investigate the logs again.

Comment 8 Anik 2021-04-21 21:04:57 UTC

*** Bug 1951617 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2021-07-27 23:01:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.