Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-operator-lifecycle-manager -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:04Z" generation: 1 labels: prometheus: alert-rules role: alert-rules name: olm-alert-rules namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1794" uid: 775f9fba-10e5-4e5e-94ad-fa1dd8878e73 spec: groups: - name: olm.csv_abnormal.rules rules: - alert: CsvAbnormalFailedOver2Min annotations: message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Reason-{{ $labels.reason }} expr: csv_abnormal{phase=~"^Failed$"} for: 2m labels: namespace: '{{ $labels.namespace }}' severity: warning - alert: CsvAbnormalOver30Min annotations: message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }} expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"} for: 30m labels: namespace: '{{ $labels.namespace }}' severity: warning - name: olm.installplan.rules rules: - alert: InstallPlanStepAppliedWithWarnings annotations: message: The API server returned a warning during installation or upgrade of an operator. An Event with reason "AppliedWithWarnings" has been created with complete details, including a reference to the InstallPlan step that generated the warning. expr: sum(increase(installplan_warnings_total[5m])) > 0 labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required
The following alert rule has issue also $ oc get prometheusrules -n openshift-marketplace -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:15:02Z" generation: 1 labels: prometheus: alert-rules role: alert-rules name: marketplace-alert-rules namespace: openshift-marketplace ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "9455" uid: 86c14ae3-a174-4139-ab8a-dc697c29c170 spec: groups: - name: marketplace.community_operators.rules rules: - alert: CommunityOperatorsCatalogError annotations: message: Default OperatorHub source "community-operators" is in Non-Ready state for more than 10 mins. expr: catalogsource_ready{name="community-operators",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning - name: marketplace.certified_operators.rules rules: - alert: CertifiedOperatorsCatalogError annotations: message: Default OperatorHub source "certified-operators" is in Non-Ready state for more than 10 mins. expr: catalogsource_ready{name="certified-operators",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning - name: marketplace.redhat_operators.rules rules: - alert: RedhatOperatorsCatalogError annotations: message: Default OperatorHub source "redhat-operators" is in Non-Ready state for more than 10 mins. expr: catalogsource_ready{name="redhat-operators",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning - name: marketplace.redhat_marketplace.rules rules: - alert: RedhatMarketplaceCatalogError annotations: message: Default OperatorHub source "redhat-marketplace" is in Non-Ready state for more than 10 mins. expr: catalogsource_ready{name="redhat-marketplace",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: ""
The following rules have issue also $ oc get prometheusrules -n openshift-cluster-samples-operator -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:00Z" generation: 1 labels: name: samples-operator-alerts name: samples-operator-alerts namespace: openshift-cluster-samples-operator ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1695" uid: 809630a2-08c8-483b-830e-06698fff640e spec: groups: - name: SamplesOperator rules: - alert: SamplesRetriesMissingOnImagestreamImportFailing annotations: message: | Samples operator is detecting problems with imagestream image imports, and the periodic retries of those imports are not occurring. Contact support. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that need to be investigated. The list of ImageStreams that have failing imports are: {{ range query "openshift_samples_failed_imagestream_import_info > 0" }} {{ .Labels.name }} {{ end }} However, the list of ImageStreams for which samples operator is retrying imports is: retrying imports: {{ range query "openshift_samples_retry_imagestream_import_total > 0" }} {{ .Labels.imagestreamname }} {{ end }} expr: sum(openshift_samples_failed_imagestream_import_info) > sum(openshift_samples_retry_imagestream_import_total) - sum(openshift_samples_retry_imagestream_import_total offset 30m) for: 2h labels: severity: warning - alert: SamplesImagestreamImportFailing annotations: message: | Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not care about having sample imagestreams available. The list of ImageStreams for which samples operator is retrying imports: {{ range query "openshift_samples_retry_imagestream_import_total > 0" }} {{ .Labels.imagestreamname }} {{ end }} expr: sum(openshift_samples_retry_imagestream_import_total) - sum(openshift_samples_retry_imagestream_import_total offset 30m) > sum(openshift_samples_failed_imagestream_import_info) for: 2h labels: severity: warning - alert: SamplesDegraded annotations: message: | Samples could not be deployed and the operator is degraded. Review the "openshift-samples" ClusterOperator object for further details. expr: openshift_samples_degraded_info == 1 for: 2h labels: severity: warning - alert: SamplesInvalidConfig annotations: message: | Samples operator has been given an invalid configuration. expr: openshift_samples_invalidconfig_info == 1 for: 2h labels: severity: warning - alert: SamplesMissingSecret annotations: message: | Samples operator cannot find the samples pull secret in the openshift namespace. expr: openshift_samples_invalidsecret_info{reason="missing_secret"} == 1 for: 2h labels: severity: warning - alert: SamplesMissingTBRCredential annotations: message: | The samples operator cannot find credentials for 'registry.redhat.io'. Many of the sample ImageStreams will fail to import unless the 'samplesRegistry' in the operator configuration is changed. expr: openshift_samples_invalidsecret_info{reason="missing_tbr_credential"} == 1 for: 2h labels: severity: warning - alert: SamplesTBRInaccessibleOnBoot annotations: message: | Samples operator could not access 'registry.redhat.io' during its initial installation and it bootstrapped as removed. If this is expected, and stems from installing in a restricted network environment, please note that if you plan on mirroring images associated with sample imagestreams into a registry available in your restricted network environment, and subsequently moving samples operator back to 'Managed' state, a list of the images associated with each image stream tag from the samples catalog is provided in the 'imagestreamtag-to-image' config map in the 'openshift-cluster-samples-operator' namespace to assist the mirroring process. expr: openshift_samples_tbr_inaccessible_info == 1 for: 2d labels: severity: info kind: List metadata: resourceVersion: "" selfLink: ""
That third message, about the ClusterSamplesOperator, should be broken out into a separate bz and is unrelated to the OLM component.
The fix for this got brought downstream via a sync PR: https://github.com/openshift/operator-framework-olm/pull/213
1, Install an OCP with the fixed PR. oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-01-172551 True False 108m Cluster version is 4.11.0-0.nightly-2022-04-01-172551 oc exec olm-operator-5c95d8557c-s2f2l -n openshift-operator-lifecycle-manager -- olm --version OLM version: 0.19.0 git commit: 55503f8eaa40ec8aa02d0b1310609c7551988554 2. Check Alert Rules (all have summary and description fields) oc get prometheusrules -n openshift-operator-lifecycle-manager -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2022-04-05T14:56:42Z" generation: 1 labels: prometheus: alert-rules role: alert-rules name: olm-alert-rules namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 27981357-7f25-4ddc-8f03-e3d9e51df244 resourceVersion: "1746" uid: 05e00325-8018-487a-8915-1b8d2cd77ecb spec: groups: - name: olm.csv_abnormal.rules rules: - alert: CsvAbnormalFailedOver2Min annotations: description: Fires whenever a CSV has been in the failed phase for more than 2 minutes. message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Reason-{{ $labels.reason }} summary: CSV failed for over 2 minutes expr: csv_abnormal{phase=~"^Failed$"} for: 2m labels: namespace: '{{ $labels.namespace }}' severity: warning - alert: CsvAbnormalOver30Min annotations: description: Fires whenever a CSV is in the Replacing, Pending, Deleting, or Unkown phase for more than 30 minutes. message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }} summary: CSV abnormal for over 30 minutes expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"} for: 30m labels: namespace: '{{ $labels.namespace }}' severity: warning - name: olm.installplan.rules rules: - alert: InstallPlanStepAppliedWithWarnings annotations: description: Fires whenever the API server returns a warning when attempting to modify an operator. message: The API server returned a warning during installation or upgrade of an operator. An Event with reason "AppliedWithWarnings" has been created with complete details, including a reference to the InstallPlan step that generated the warning. summary: API returned a warning when modifying an operator expr: sum(increase(installplan_warnings_total[5m])) > 0 labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" oc get prometheusrules -n openshift-marketplace -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: capability.openshift.io/name: marketplace include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-04-05T15:02:23Z" generation: 1 labels: prometheus: alert-rules role: alert-rules name: marketplace-alert-rules namespace: openshift-marketplace ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 27981357-7f25-4ddc-8f03-e3d9e51df244 resourceVersion: "10198" uid: 665cc52e-d664-4e2c-99d4-43a8ce7df05a spec: groups: - name: marketplace.community_operators.rules rules: - alert: CommunityOperatorsCatalogError annotations: description: Fires whenever the community-operators source is not ready for more than 10 mins. message: Default OperatorHub source "community-operators" is in Non-Ready state for more than 10 mins. summary: Community-operators not ready for 10 minutes expr: catalogsource_ready{name="community-operators",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning - name: marketplace.certified_operators.rules rules: - alert: CertifiedOperatorsCatalogError annotations: description: Fires whenever the certified-operators source is not ready for more than 10 mins. message: Default OperatorHub source "certified-operators" is in Non-Ready state for more than 10 mins. summary: Certified-operators not ready for more than 10 minutes expr: catalogsource_ready{name="certified-operators",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning - name: marketplace.redhat_operators.rules rules: - alert: RedhatOperatorsCatalogError annotations: description: Fires whenever the redhat-operators source is not ready for more than 10 mins. message: Default OperatorHub source "redhat-operators" is in Non-Ready state for more than 10 mins. summary: Redhat-operators not ready for more than 10 minutes expr: catalogsource_ready{name="redhat-operators",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning - name: marketplace.redhat_marketplace.rules rules: - alert: RedhatMarketplaceCatalogError annotations: description: Fires whenever the redhat-marketplace source is not ready for more than 10 mins. message: Default OperatorHub source "redhat-marketplace" is in Non-Ready state for more than 10 mins. summary: Redhat-marketplace not ready for more than 10 minutes expr: catalogsource_ready{name="redhat-marketplace",exported_namespace="openshift-marketplace"} == 0 for: 10m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: " Marked as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.10.31 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6258