Bug 1936585

Summary: configure alerts if the catalogsources are missing
Product: OpenShift Container Platform Reporter: Jatan Malde <jmalde>
Component: OLMAssignee: Anik <anbhatta>
OLM sub component: OperatorHub QA Contact: xzha
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: anbhatta, kuiwang, nhale
Version: 4.6Keywords: Reopened, Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Till OCP 4.5, the default catalogs deployed and managed by the marketplace-operator in the openshift-marketplace namespace were created by OperatorSources, the api expose by marketpalce-operator. Appropriate metrics/alerting were instrumented to indicate an error encountered by the OperatorSources. In OCP 4.6, OperatorSources were removed(after being deprecated for several previous releases) and the marketplace-operator directly created OLM's CatalogSource resource instead of OperatorSources. However, the same metrics/alerting instrumentation was not done for CatalogSources deployed in the openshift-marketplace namespace. Consequence: Any errors encountered by these default CatalogSources were not highlighted using prometheus alerting. Fix: A new metric `catalogsource_ready` was introduced in olm (https://github.com/operator-framework/operator-lifecycle-manager/pull/2152), which is then used by the marketplace-operator to fire alerts whenever the metric for a default CatalogSource indicated that a CatalogSource is in an unready state. Result: Prometheus alerts for unready default CatalogSource in the openshift-marketplace namespace (https://github.com/operator-framework/operator-marketplace/pull/402#issue-647984868)
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:51:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1961320    
Bug Blocks:    

Description Jatan Malde 2021-03-08 19:21:32 UTC
Description of problem:

IHAC reporting issue with OperatorHub not showing all the provider types. 

Checking the catalog sources pods, we do see errors from the proxy while connecting to the mentioned sources in the catalogsource but there is no alert with this reported mentioning the requirement of pod restart post the network issue getting resolved. 

Initially there were alerts for operatorsources but looks like those are depricated in 4.6 and so are the alerts with the same. 

Can we have them configured for 4.6 and higher version with respect to the catalogsources. 

Version-Release number of selected component (if applicable):

4.6.16

How reproducible:


Steps to Reproduce:
1. Start 4.6 cluster
2. Kill any of the registry pods(community-operators/redhat-marketplace etc) in the openshift-marketplace namespace 
3.

Actual results:

No alerts fired

Expected results:

Alerts show up in the dashboard about problematic default source. 

Additional info:
There were prometheus rules for firing alerts in these scenarios up until 4.5 https://github.com/operator-framework/operator-marketplace/blob/release-4.5/manifests/12_prometheus_rule.yaml

This is missing from main branch right now: https://github.com/operator-framework/operator-marketplace/blob/master/manifests/12_prometheus_rule.yaml

Comment 2 Nick Hale 2021-03-08 21:17:19 UTC
Jatan,


> Checking the catalog sources pods, we do see errors from the proxy while connecting to the mentioned sources in the catalogsource but there is no alert with this reported mentioning the requirement of pod restart post the network issue getting resolved. 

In 4.6, the "marketplace" catalogs included in OpenShift were switched from being sourced from AppRegistry, to being sourced from a data store baked into the catalog images themselves. This means that the catalog pods don't make any external requests at runtime, so the issue doesn't seem to make much sense as described.

That being said, I'm going to close this out. If you feel there's been a misunderstanding, please re-open this BZ with reproduction steps, expected, and actual results on a 4.6+ cluster.

Thanks!

Comment 8 xzha 2021-05-27 09:38:56 UTC
verify:

zhaoxia@xzha-mac bug-1945548 % oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-27-030333   True        False         62m     Cluster version is 4.8.0-0.nightly-2021-05-27-030333

1, mark all worker nodes as unschedulable
zhaoxia@xzha-mac bug-1945548 % oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-dfls75b-f76d1-lp7sf-master-0         Ready    master   55m   v1.21.0-rc.0+bb94839
ci-ln-dfls75b-f76d1-lp7sf-master-1         Ready    master   55m   v1.21.0-rc.0+bb94839
ci-ln-dfls75b-f76d1-lp7sf-master-2         Ready    master   55m   v1.21.0-rc.0+bb94839
ci-ln-dfls75b-f76d1-lp7sf-worker-b-j2q5q   Ready    worker   47m   v1.21.0-rc.0+bb94839
ci-ln-dfls75b-f76d1-lp7sf-worker-c-pv2n7   Ready    worker   44m   v1.21.0-rc.0+bb94839
ci-ln-dfls75b-f76d1-lp7sf-worker-d-llvnf   Ready    worker   44m   v1.21.0-rc.0+bb94839  
zhaoxia@xzha-mac bug-1945548 % oc adm cordon ci-ln-dfls75b-f76d1-lp7sf-worker-b-j2q5q ci-ln-dfls75b-f76d1-lp7sf-worker-c-pv2n7 ci-ln-dfls75b-f76d1-lp7sf-worker-d-llvnf

2, delete pod certified-operators and  community-operators
zhaoxia@xzha-mac bug-1945548 % oc delete pod certified-operators-88s99 
pod "certified-operators-88s99" deleted
zhaoxia@xzha-mac bug-1945548 % oc delete pod community-operators-wtzdp 
pod "community-operators-wtzdp" deleted

3, check pod status
zhaoxia@xzha-mac bug-1945548 % oc get pod
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-cplq7               0/1     Pending   0          26m
certified-operators-t84zp               0/1     Pending   0          29m
community-operators-nm2hp               0/1     Pending   0          21m
community-operators-rmwbs               0/1     Pending   0          29m
marketplace-operator-64586f9897-bcbwm   1/1     Running   0          88m
redhat-marketplace-dffn8                1/1     Running   0          86m
redhat-marketplace-qhmmc                0/1     Pending   0          21m
redhat-operators-qn8mw                  0/1     Pending   0          30m
redhat-operators-smmlq                  1/1     Running   0          32m

4. check Metics and alert
Metics:
https://user-images.githubusercontent.com/77608951/119802706-5d900180-bf11-11eb-948e-c1e719c4c8b6.png
https://user-images.githubusercontent.com/77608951/119802877-857f6500-bf11-11eb-9cb5-2420e7e1cbc7.png

there is no alert CommunityOperatorsCatalogError and CertifiedOperatorsCatalogError
and there is alert rule CommunityOperatorsCatalogError and CertifiedOperatorsCatalogError 

https://user-images.githubusercontent.com/77608951/119803283-e313b180-bf11-11eb-8af3-f8eb442fb437.png
https://user-images.githubusercontent.com/77608951/119803550-240bc600-bf12-11eb-9fc3-6363322a260a.png

verify failed.

Comment 10 xzha 2021-05-31 07:30:11 UTC
Verify

[root@preserve-olm-agent-test ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-29-114625   True        False         10m     Cluster version is 4.8.0-0.nightly-2021-05-29-114625

1, mark all worker nodes as unschedulable
[root@preserve-olm-agent-test ~]# oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-133-33.us-east-2.compute.internal    Ready    worker   33m   v1.21.0-rc.0+4b2b6ff
ip-10-0-148-42.us-east-2.compute.internal    Ready    master   36m   v1.21.0-rc.0+4b2b6ff
ip-10-0-172-186.us-east-2.compute.internal   Ready    worker   32m   v1.21.0-rc.0+4b2b6ff
ip-10-0-177-55.us-east-2.compute.internal    Ready    master   36m   v1.21.0-rc.0+4b2b6ff
ip-10-0-196-44.us-east-2.compute.internal    Ready    master   40m   v1.21.0-rc.0+4b2b6ff
ip-10-0-202-160.us-east-2.compute.internal   Ready    worker   33m   v1.21.0-rc.0+4b2b6ff
[root@preserve-olm-agent-test ~]# oc adm cordon ip-10-0-133-33.us-east-2.compute.internal ip-10-0-172-186.us-east-2.compute.internal ip-10-0-202-160.us-east-2.compute.internal
node/ip-10-0-133-33.us-east-2.compute.internal cordoned
node/ip-10-0-172-186.us-east-2.compute.internal cordoned
node/ip-10-0-202-160.us-east-2.compute.internal cordoned
[root@preserve-olm-agent-test ~]# oc get nodes
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-133-33.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   34m   v1.21.0-rc.0+4b2b6ff
ip-10-0-148-42.us-east-2.compute.internal    Ready                      master   37m   v1.21.0-rc.0+4b2b6ff
ip-10-0-172-186.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   33m   v1.21.0-rc.0+4b2b6ff
ip-10-0-177-55.us-east-2.compute.internal    Ready                      master   37m   v1.21.0-rc.0+4b2b6ff
ip-10-0-196-44.us-east-2.compute.internal    Ready                      master   41m   v1.21.0-rc.0+4b2b6ff
ip-10-0-202-160.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   33m   v1.21.0-rc.0+4b2b6ff

2. delete pod redhat-operators and  community-operators
[root@preserve-olm-agent-test ~]# oc get pod
NAME                                                              READY   STATUS      RESTARTS   AGE
a044a04a2dc6f2e2698bbf09dcbb0874c8f72e00cd06b114c381d0437bnmzjx   0/1     Completed   0          9m12s
cd41f97077889e6ec45c6e342d9e32c7296fefc73c482440a60cfa4e48jwhwh   0/1     Completed   0          9m12s
certified-operators-vg7hk                                         1/1     Running     0          11m
community-operators-6nc2q                                         1/1     Running     0          11m
marketplace-operator-745fbf4bfd-hc4b6                             1/1     Running     0          11m
qe-app-registry-rb5ps                                             1/1     Running     0          9m
redhat-marketplace-bj4n8                                          1/1     Running     0          11m
redhat-operators-fktt9                                            1/1     Running     0          11m
[root@preserve-olm-agent-test ~]# oc delete pod community-operators-6nc2q   
pod "community-operators-6nc2q" deleted
[root@preserve-olm-agent-test ~]# oc delete pod redhat-operators-fktt9  
pod "redhat-operators-fktt9" deleted

3, check Metics and alert

https://user-images.githubusercontent.com/77608951/120154328-596e2780-c222-11eb-9474-b3078a4d6e51.png
https://user-images.githubusercontent.com/77608951/120156670-c8e51680-c224-11eb-8ef8-2f8bf6ba70e5.png
https://user-images.githubusercontent.com/77608951/120156800-eb772f80-c224-11eb-9cb5-89c1ffbeb5a5.png

alerts CommunityOperatorsCatalogError CertifiedOperatorsCatalogError RedhatOperatorsCatalogError RedhatMarketplaceCatalogError happed.

LGTM, verified.

Comment 13 errata-xmlrpc 2021-07-27 22:51:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438