Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1949279

Summary:

Default catalog sources not available after upgrade from 4.5 to 4.6

Product:

OpenShift Container Platform

Reporter:

John McMeeking <jmcmeek>

Component:

OLM

Assignee:

Alexander Greene <agreene>

OLM sub component:

OperatorHub

QA Contact:

Jian Zhang <jiazha>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

medium

Priority:

medium

CC:

agreene, anbhatta, krizza, mczernek

Version:

4.6

Keywords:

Reopened

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-01-05 19:45:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
must-gather output	none
control plane logs - apiserver, etc	none

Description John McMeeking 2021-04-13 20:25:40 UTC

Description of problem: After upgrading from 4.5 to 4.6 OperatorHub shows "No OperatorHub Items Found" or lists operators from only some of the default catalog sources. Deleting the openshift-marketplace pods associated with the missing content fixes the problem. Sometimes a pod has to be deleted more than once.

At this time 4.6 is the latest OpenShift release available on IBM Cloud.


Version-Release number of selected component (if applicable):
Red Hat OpenShift on IBM Cloud v4.6.22.  Also seen on earlier 4.6 releases and upgrades from various 4.5 releases.


How reproducible:
Seems easily reproduced.  In 3 upgrade attempts I was always missing operators from one or more of the catalog sources.


Steps to Reproduce:
1. Deploy Red Hat OpenShift on IBM Cloud v4.5
2. Upgrade cluster to v4.6
3. Wait an hour to be sure everythign is stabilized, then open the OpenShift Console and go to OperatorHub.

Actual results:
OperatorHub shows "No OperatorHub Items Found" or lists operators from only some of the default catalog sources.


Expected results:
OperatorHub shows content from all 4 catalog sources.

Additional info:
The packageserver pods in openshift-operator-lifecycle-manager log message like:

time="2021-04-09T16:27:15Z" level=info msg="connecting to source" action="sync catalogsource" address="..svc:" name=certified-operators namespace=openshift-marketplace
time="2021-04-09T16:27:43Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="..svc:" name=certified-operators namespace=openshift-marketplace
time="2021-04-09T16:27:43Z" level=warning msg="error getting stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: lookup ..svc: no such host\"" source="{certified-operators openshift-m
arketplace}"

The address "..svc:" seems to be the immediate cause.  Once everything is working those log the expected address - "certified-operators.openshift-marketplace.svc:50051"


I don't see any errors in the "certified-operators" pod log (or the others).  Before and after they have the same single message:
time="2021-04-09T16:27:31Z" level=info msg="serving registry" database=/database/index.db port=50051


'oc get co' shows all clusteroperators successfully upgraded to 4.6.22.

A customized 'oc get catalogsource' shows this information from '.status.connectionState' before restarting pods:
NAME                  ADDRESS                                               STATE
certified-operators   certified-operators.openshift-marketplace.svc:50051   TRANSIENT_FAILURE
community-operators   ..svc:                                                TRANSIENT_FAILURE
redhat-marketplace    ..svc:                                                TRANSIENT_FAILURE
redhat-operators      ..svc:                                                TRANSIENT_FAILURE

Even after restarting all the pods and seeing proper content in OperatorHub, the connectionState info reports inconsistent addresses:
NAME                  ADDRESS                                               STATE
certified-operators   certified-operators.openshift-marketplace.svc:50051   READY
community-operators   ..svc:                                                READY
redhat-marketplace    redhat-marketplace.openshift-marketplace.svc:50051    READY
redhat-operators      redhat-operators.openshift-marketplace.svc:50051      READY

Comment 1 Kevin Rizza 2021-04-14 13:00:58 UTC

It seems like this might be running into some kind of timing failure. In 4.6 the CatalogSources are backed by real catalog images rather than app-registry backed OperatorSources (which have been removed). I'm wondering if there is some transient network or registry connection problem that the CatalogSources are timing out on.

Can you please provide a must gather for this cluster?

Comment 2 John McMeeking 2021-04-14 15:09:00 UTC

@krizza What control plane logs do you need (i.e. kube-apiserver, kube-controller-manager, openshift-apiserver, openshift-controller-manager, ...) I'll need to collect any of those separately.

Comment 3 John McMeeking 2021-04-14 20:34:53 UTC

Created attachment 1771980 [details]
must-gather output

Comment 4 John McMeeking 2021-04-14 20:38:04 UTC

Created attachment 1771982 [details]
control plane logs - apiserver, etc

I have attached the must-gather output and control plane logs (kube-apiserver, openshift-apiserver, etc.).  The cluster I created for this is current displaying "No OperatorHub Items Found"

Comment 6 Marek Czernek 2021-09-09 14:48:25 UTC

We run into this issue from time to time in our OpenShift-based training. I have logged https://training-feedback.redhat.com/browse/DO328-30.

Let me know if I can help in this ticket. Seems like a race condition to me; our marketplace is healthy, but when users turn off and on the environment (and therefore they re-start all cluster nodes), we sometimes see that the marketplace does not come up. In the events, I can see logging about reinstalling the packageserver operator, which fails.

After manually removing the associated pods, the installation succeeds.

Comment 7 Anik 2021-11-01 19:21:16 UTC

Marek, 

Looks like there was an issue with the lab environment. I have tried to reproduce the issue by launching 4.5 cluster and then upgrading it to 4.6, but was unable to see any issue there. I'm assuming the lab environment has been fixed. 

Closing this on our side.

Comment 8 Marek Czernek 2021-11-02 07:53:45 UTC

Anik,

I have closed the Jira had linked because I implemented a workaround in our scripts (kill the pod, which means the pod restarts). I am not the original creator of this bug, so closing the entire issue based on a related Jira is a bit strange.

Facts of our environment:

1. Issue is not upgrade related. "oc api-resources" did exit with 0 in 4.5, does exit with "1" in 4.6.
2. Issue can be IBM cloud related. Our OCP environment runs on OSP on the IBM cloud AFAIK.
3. I don't know what you tried in our lab environment, when I provision DO328, log in to OCP, and execute oc api-resources, I still get the fail. Without the patch in our lab scripts, the env wouldn't be in a functional state. I have just replicated it to be sure:

- Provision environment
- Wait until OCP comes up to log in
- oc api-resources fails

Comment 9 Kevin Rizza 2022-01-05 19:45:58 UTC

This appears to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2026343. Closing as duplicate.

*** This bug has been marked as a duplicate of bug 2026343 ***