Description of problem: After upgrading from 4.5 to 4.6 OperatorHub shows "No OperatorHub Items Found" or lists operators from only some of the default catalog sources. Deleting the openshift-marketplace pods associated with the missing content fixes the problem. Sometimes a pod has to be deleted more than once. At this time 4.6 is the latest OpenShift release available on IBM Cloud. Version-Release number of selected component (if applicable): Red Hat OpenShift on IBM Cloud v4.6.22. Also seen on earlier 4.6 releases and upgrades from various 4.5 releases. How reproducible: Seems easily reproduced. In 3 upgrade attempts I was always missing operators from one or more of the catalog sources. Steps to Reproduce: 1. Deploy Red Hat OpenShift on IBM Cloud v4.5 2. Upgrade cluster to v4.6 3. Wait an hour to be sure everythign is stabilized, then open the OpenShift Console and go to OperatorHub. Actual results: OperatorHub shows "No OperatorHub Items Found" or lists operators from only some of the default catalog sources. Expected results: OperatorHub shows content from all 4 catalog sources. Additional info: The packageserver pods in openshift-operator-lifecycle-manager log message like: time="2021-04-09T16:27:15Z" level=info msg="connecting to source" action="sync catalogsource" address="..svc:" name=certified-operators namespace=openshift-marketplace time="2021-04-09T16:27:43Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="..svc:" name=certified-operators namespace=openshift-marketplace time="2021-04-09T16:27:43Z" level=warning msg="error getting stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: lookup ..svc: no such host\"" source="{certified-operators openshift-m arketplace}" The address "..svc:" seems to be the immediate cause. Once everything is working those log the expected address - "certified-operators.openshift-marketplace.svc:50051" I don't see any errors in the "certified-operators" pod log (or the others). Before and after they have the same single message: time="2021-04-09T16:27:31Z" level=info msg="serving registry" database=/database/index.db port=50051 'oc get co' shows all clusteroperators successfully upgraded to 4.6.22. A customized 'oc get catalogsource' shows this information from '.status.connectionState' before restarting pods: NAME ADDRESS STATE certified-operators certified-operators.openshift-marketplace.svc:50051 TRANSIENT_FAILURE community-operators ..svc: TRANSIENT_FAILURE redhat-marketplace ..svc: TRANSIENT_FAILURE redhat-operators ..svc: TRANSIENT_FAILURE Even after restarting all the pods and seeing proper content in OperatorHub, the connectionState info reports inconsistent addresses: NAME ADDRESS STATE certified-operators certified-operators.openshift-marketplace.svc:50051 READY community-operators ..svc: READY redhat-marketplace redhat-marketplace.openshift-marketplace.svc:50051 READY redhat-operators redhat-operators.openshift-marketplace.svc:50051 READY
It seems like this might be running into some kind of timing failure. In 4.6 the CatalogSources are backed by real catalog images rather than app-registry backed OperatorSources (which have been removed). I'm wondering if there is some transient network or registry connection problem that the CatalogSources are timing out on. Can you please provide a must gather for this cluster?
@krizza What control plane logs do you need (i.e. kube-apiserver, kube-controller-manager, openshift-apiserver, openshift-controller-manager, ...) I'll need to collect any of those separately.
Created attachment 1771980 [details] must-gather output
Created attachment 1771982 [details] control plane logs - apiserver, etc I have attached the must-gather output and control plane logs (kube-apiserver, openshift-apiserver, etc.). The cluster I created for this is current displaying "No OperatorHub Items Found"
We run into this issue from time to time in our OpenShift-based training. I have logged https://training-feedback.redhat.com/browse/DO328-30. Let me know if I can help in this ticket. Seems like a race condition to me; our marketplace is healthy, but when users turn off and on the environment (and therefore they re-start all cluster nodes), we sometimes see that the marketplace does not come up. In the events, I can see logging about reinstalling the packageserver operator, which fails. After manually removing the associated pods, the installation succeeds.
Marek, Looks like there was an issue with the lab environment. I have tried to reproduce the issue by launching 4.5 cluster and then upgrading it to 4.6, but was unable to see any issue there. I'm assuming the lab environment has been fixed. Closing this on our side.
Anik, I have closed the Jira had linked because I implemented a workaround in our scripts (kill the pod, which means the pod restarts). I am not the original creator of this bug, so closing the entire issue based on a related Jira is a bit strange. Facts of our environment: 1. Issue is not upgrade related. "oc api-resources" did exit with 0 in 4.5, does exit with "1" in 4.6. 2. Issue can be IBM cloud related. Our OCP environment runs on OSP on the IBM cloud AFAIK. 3. I don't know what you tried in our lab environment, when I provision DO328, log in to OCP, and execute oc api-resources, I still get the fail. Without the patch in our lab scripts, the env wouldn't be in a functional state. I have just replicated it to be sure: - Provision environment - Wait until OCP comes up to log in - oc api-resources fails
This appears to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2026343. Closing as duplicate. *** This bug has been marked as a duplicate of bug 2026343 ***