Description of problem: IHAC reporting issue with OperatorHub not showing all the provider types. Checking the catalog sources pods, we do see errors from the proxy while connecting to the mentioned sources in the catalogsource but there is no alert with this reported mentioning the requirement of pod restart post the network issue getting resolved. Initially there were alerts for operatorsources but looks like those are depricated in 4.6 and so are the alerts with the same. Can we have them configured for 4.6 and higher version with respect to the catalogsources. Version-Release number of selected component (if applicable): 4.6.16 How reproducible: Steps to Reproduce: 1. Start 4.6 cluster 2. Kill any of the registry pods(community-operators/redhat-marketplace etc) in the openshift-marketplace namespace 3. Actual results: No alerts fired Expected results: Alerts show up in the dashboard about problematic default source. Additional info: There were prometheus rules for firing alerts in these scenarios up until 4.5 https://github.com/operator-framework/operator-marketplace/blob/release-4.5/manifests/12_prometheus_rule.yaml This is missing from main branch right now: https://github.com/operator-framework/operator-marketplace/blob/master/manifests/12_prometheus_rule.yaml
Jatan, > Checking the catalog sources pods, we do see errors from the proxy while connecting to the mentioned sources in the catalogsource but there is no alert with this reported mentioning the requirement of pod restart post the network issue getting resolved. In 4.6, the "marketplace" catalogs included in OpenShift were switched from being sourced from AppRegistry, to being sourced from a data store baked into the catalog images themselves. This means that the catalog pods don't make any external requests at runtime, so the issue doesn't seem to make much sense as described. That being said, I'm going to close this out. If you feel there's been a misunderstanding, please re-open this BZ with reproduction steps, expected, and actual results on a 4.6+ cluster. Thanks!
verify: zhaoxia@xzha-mac bug-1945548 % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-27-030333 True False 62m Cluster version is 4.8.0-0.nightly-2021-05-27-030333 1, mark all worker nodes as unschedulable zhaoxia@xzha-mac bug-1945548 % oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-dfls75b-f76d1-lp7sf-master-0 Ready master 55m v1.21.0-rc.0+bb94839 ci-ln-dfls75b-f76d1-lp7sf-master-1 Ready master 55m v1.21.0-rc.0+bb94839 ci-ln-dfls75b-f76d1-lp7sf-master-2 Ready master 55m v1.21.0-rc.0+bb94839 ci-ln-dfls75b-f76d1-lp7sf-worker-b-j2q5q Ready worker 47m v1.21.0-rc.0+bb94839 ci-ln-dfls75b-f76d1-lp7sf-worker-c-pv2n7 Ready worker 44m v1.21.0-rc.0+bb94839 ci-ln-dfls75b-f76d1-lp7sf-worker-d-llvnf Ready worker 44m v1.21.0-rc.0+bb94839 zhaoxia@xzha-mac bug-1945548 % oc adm cordon ci-ln-dfls75b-f76d1-lp7sf-worker-b-j2q5q ci-ln-dfls75b-f76d1-lp7sf-worker-c-pv2n7 ci-ln-dfls75b-f76d1-lp7sf-worker-d-llvnf 2, delete pod certified-operators and community-operators zhaoxia@xzha-mac bug-1945548 % oc delete pod certified-operators-88s99 pod "certified-operators-88s99" deleted zhaoxia@xzha-mac bug-1945548 % oc delete pod community-operators-wtzdp pod "community-operators-wtzdp" deleted 3, check pod status zhaoxia@xzha-mac bug-1945548 % oc get pod NAME READY STATUS RESTARTS AGE certified-operators-cplq7 0/1 Pending 0 26m certified-operators-t84zp 0/1 Pending 0 29m community-operators-nm2hp 0/1 Pending 0 21m community-operators-rmwbs 0/1 Pending 0 29m marketplace-operator-64586f9897-bcbwm 1/1 Running 0 88m redhat-marketplace-dffn8 1/1 Running 0 86m redhat-marketplace-qhmmc 0/1 Pending 0 21m redhat-operators-qn8mw 0/1 Pending 0 30m redhat-operators-smmlq 1/1 Running 0 32m 4. check Metics and alert Metics: https://user-images.githubusercontent.com/77608951/119802706-5d900180-bf11-11eb-948e-c1e719c4c8b6.png https://user-images.githubusercontent.com/77608951/119802877-857f6500-bf11-11eb-9cb5-2420e7e1cbc7.png there is no alert CommunityOperatorsCatalogError and CertifiedOperatorsCatalogError and there is alert rule CommunityOperatorsCatalogError and CertifiedOperatorsCatalogError https://user-images.githubusercontent.com/77608951/119803283-e313b180-bf11-11eb-8af3-f8eb442fb437.png https://user-images.githubusercontent.com/77608951/119803550-240bc600-bf12-11eb-9fc3-6363322a260a.png verify failed.
Verify [root@preserve-olm-agent-test ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-29-114625 True False 10m Cluster version is 4.8.0-0.nightly-2021-05-29-114625 1, mark all worker nodes as unschedulable [root@preserve-olm-agent-test ~]# oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-133-33.us-east-2.compute.internal Ready worker 33m v1.21.0-rc.0+4b2b6ff ip-10-0-148-42.us-east-2.compute.internal Ready master 36m v1.21.0-rc.0+4b2b6ff ip-10-0-172-186.us-east-2.compute.internal Ready worker 32m v1.21.0-rc.0+4b2b6ff ip-10-0-177-55.us-east-2.compute.internal Ready master 36m v1.21.0-rc.0+4b2b6ff ip-10-0-196-44.us-east-2.compute.internal Ready master 40m v1.21.0-rc.0+4b2b6ff ip-10-0-202-160.us-east-2.compute.internal Ready worker 33m v1.21.0-rc.0+4b2b6ff [root@preserve-olm-agent-test ~]# oc adm cordon ip-10-0-133-33.us-east-2.compute.internal ip-10-0-172-186.us-east-2.compute.internal ip-10-0-202-160.us-east-2.compute.internal node/ip-10-0-133-33.us-east-2.compute.internal cordoned node/ip-10-0-172-186.us-east-2.compute.internal cordoned node/ip-10-0-202-160.us-east-2.compute.internal cordoned [root@preserve-olm-agent-test ~]# oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-133-33.us-east-2.compute.internal Ready,SchedulingDisabled worker 34m v1.21.0-rc.0+4b2b6ff ip-10-0-148-42.us-east-2.compute.internal Ready master 37m v1.21.0-rc.0+4b2b6ff ip-10-0-172-186.us-east-2.compute.internal Ready,SchedulingDisabled worker 33m v1.21.0-rc.0+4b2b6ff ip-10-0-177-55.us-east-2.compute.internal Ready master 37m v1.21.0-rc.0+4b2b6ff ip-10-0-196-44.us-east-2.compute.internal Ready master 41m v1.21.0-rc.0+4b2b6ff ip-10-0-202-160.us-east-2.compute.internal Ready,SchedulingDisabled worker 33m v1.21.0-rc.0+4b2b6ff 2. delete pod redhat-operators and community-operators [root@preserve-olm-agent-test ~]# oc get pod NAME READY STATUS RESTARTS AGE a044a04a2dc6f2e2698bbf09dcbb0874c8f72e00cd06b114c381d0437bnmzjx 0/1 Completed 0 9m12s cd41f97077889e6ec45c6e342d9e32c7296fefc73c482440a60cfa4e48jwhwh 0/1 Completed 0 9m12s certified-operators-vg7hk 1/1 Running 0 11m community-operators-6nc2q 1/1 Running 0 11m marketplace-operator-745fbf4bfd-hc4b6 1/1 Running 0 11m qe-app-registry-rb5ps 1/1 Running 0 9m redhat-marketplace-bj4n8 1/1 Running 0 11m redhat-operators-fktt9 1/1 Running 0 11m [root@preserve-olm-agent-test ~]# oc delete pod community-operators-6nc2q pod "community-operators-6nc2q" deleted [root@preserve-olm-agent-test ~]# oc delete pod redhat-operators-fktt9 pod "redhat-operators-fktt9" deleted 3, check Metics and alert https://user-images.githubusercontent.com/77608951/120154328-596e2780-c222-11eb-9474-b3078a4d6e51.png https://user-images.githubusercontent.com/77608951/120156670-c8e51680-c224-11eb-8ef8-2f8bf6ba70e5.png https://user-images.githubusercontent.com/77608951/120156800-eb772f80-c224-11eb-9cb5-89c1ffbeb5a5.png alerts CommunityOperatorsCatalogError CertifiedOperatorsCatalogError RedhatOperatorsCatalogError RedhatMarketplaceCatalogError happed. LGTM, verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438