1992510 – all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines

Bug 1992510 - all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines

Summary: all the alert rules' annotations "summary" and "description" should comply wi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Tyler Slaton
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-11 09:13 UTC by hongyan li
Modified:	2022-09-08 05:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-08 05:41:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 213	0	None	Merged	Sync w/ upstream: operator-lifecycle-manager	2022-02-25 18:48:47 UTC
Red Hat Product Errata	RHSA-2022:6258	0	None	None	None	2022-09-08 05:41:32 UTC

Description hongyan li 2021-08-11 09:13:46 UTC

Description of problem:
all the alert rules'  annotations "summary" and "description"  should comply with the OpenShift alerting guidelines

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-07-175228

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:
$ oc get prometheusrules -n openshift-operator-lifecycle-manager -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:12:04Z"
    generation: 1
    labels:
      prometheus: alert-rules
      role: alert-rules
    name: olm-alert-rules
    namespace: openshift-operator-lifecycle-manager
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "1794"
    uid: 775f9fba-10e5-4e5e-94ad-fa1dd8878e73
  spec:
    groups:
    - name: olm.csv_abnormal.rules
      rules:
      - alert: CsvAbnormalFailedOver2Min
        annotations:
          message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
            }}. Reason-{{ $labels.reason }}
        expr: csv_abnormal{phase=~"^Failed$"}
        for: 2m
        labels:
          namespace: '{{ $labels.namespace }}'
          severity: warning
      - alert: CsvAbnormalOver30Min
        annotations:
          message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
            }}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }}
        expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"}
        for: 30m
        labels:
          namespace: '{{ $labels.namespace }}'
          severity: warning
    - name: olm.installplan.rules
      rules:
      - alert: InstallPlanStepAppliedWithWarnings
        annotations:
          message: The API server returned a warning during installation or upgrade
            of an operator. An Event with reason "AppliedWithWarnings" has been created
            with complete details, including a reference to the InstallPlan step that
            generated the warning.
        expr: sum(increase(installplan_warnings_total[5m])) > 0
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""



Expected results:
alert rules have annotations "summary" and "description"

Additional info:
the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1]

[1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Comment 1 hongyan li 2021-08-11 09:27:32 UTC

The following alert rule has issue also
$ oc get prometheusrules -n openshift-marketplace -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:15:02Z"
    generation: 1
    labels:
      prometheus: alert-rules
      role: alert-rules
    name: marketplace-alert-rules
    namespace: openshift-marketplace
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "9455"
    uid: 86c14ae3-a174-4139-ab8a-dc697c29c170
  spec:
    groups:
    - name: marketplace.community_operators.rules
      rules:
      - alert: CommunityOperatorsCatalogError
        annotations:
          message: Default OperatorHub source "community-operators" is in Non-Ready
            state for more than 10 mins.
        expr: catalogsource_ready{name="community-operators",exported_namespace="openshift-marketplace"}
          == 0
        for: 10m
        labels:
          severity: warning
    - name: marketplace.certified_operators.rules
      rules:
      - alert: CertifiedOperatorsCatalogError
        annotations:
          message: Default OperatorHub source "certified-operators" is in Non-Ready
            state for more than 10 mins.
        expr: catalogsource_ready{name="certified-operators",exported_namespace="openshift-marketplace"}
          == 0
        for: 10m
        labels:
          severity: warning
    - name: marketplace.redhat_operators.rules
      rules:
      - alert: RedhatOperatorsCatalogError
        annotations:
          message: Default OperatorHub source "redhat-operators" is in Non-Ready state
            for more than 10 mins.
        expr: catalogsource_ready{name="redhat-operators",exported_namespace="openshift-marketplace"}
          == 0
        for: 10m
        labels:
          severity: warning
    - name: marketplace.redhat_marketplace.rules
      rules:
      - alert: RedhatMarketplaceCatalogError
        annotations:
          message: Default OperatorHub source "redhat-marketplace" is in Non-Ready
            state for more than 10 mins.
        expr: catalogsource_ready{name="redhat-marketplace",exported_namespace="openshift-marketplace"}
          == 0
        for: 10m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 2 hongyan li 2021-08-11 10:13:15 UTC

The following rules have issue also
$ oc get prometheusrules -n openshift-cluster-samples-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:12:00Z"
    generation: 1
    labels:
      name: samples-operator-alerts
    name: samples-operator-alerts
    namespace: openshift-cluster-samples-operator
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "1695"
    uid: 809630a2-08c8-483b-830e-06698fff640e
  spec:
    groups:
    - name: SamplesOperator
      rules:
      - alert: SamplesRetriesMissingOnImagestreamImportFailing
        annotations:
          message: |
            Samples operator is detecting problems with imagestream image imports, and the periodic retries of those
            imports are not occurring.  Contact support.  You can look at the "openshift-samples" ClusterOperator object
            for details. Most likely there are issues with the external image registry hosting the images that need to
            be investigated.  The list of ImageStreams that have failing imports are:
            {{ range query "openshift_samples_failed_imagestream_import_info > 0" }}
              {{ .Labels.name }}
            {{ end }}
            However, the list of ImageStreams for which samples operator is retrying imports is:
            retrying imports:
            {{ range query "openshift_samples_retry_imagestream_import_total > 0" }}
               {{ .Labels.imagestreamname }}
            {{ end }}
        expr: sum(openshift_samples_failed_imagestream_import_info) > sum(openshift_samples_retry_imagestream_import_total)
          - sum(openshift_samples_retry_imagestream_import_total offset 30m)
        for: 2h
        labels:
          severity: warning
      - alert: SamplesImagestreamImportFailing
        annotations:
          message: |
            Samples operator is detecting problems with imagestream image imports.  You can look at the "openshift-samples"
            ClusterOperator object for details. Most likely there are issues with the external image registry hosting
            the images that needs to be investigated.  Or you can consider marking samples operator Removed if you do not
            care about having sample imagestreams available.  The list of ImageStreams for which samples operator is
            retrying imports:
            {{ range query "openshift_samples_retry_imagestream_import_total > 0" }}
               {{ .Labels.imagestreamname }}
            {{ end }}
        expr: sum(openshift_samples_retry_imagestream_import_total) - sum(openshift_samples_retry_imagestream_import_total
          offset 30m) > sum(openshift_samples_failed_imagestream_import_info)
        for: 2h
        labels:
          severity: warning
      - alert: SamplesDegraded
        annotations:
          message: |
            Samples could not be deployed and the operator is degraded. Review the "openshift-samples" ClusterOperator object for further details.
        expr: openshift_samples_degraded_info == 1
        for: 2h
        labels:
          severity: warning
      - alert: SamplesInvalidConfig
        annotations:
          message: |
            Samples operator has been given an invalid configuration.
        expr: openshift_samples_invalidconfig_info == 1
        for: 2h
        labels:
          severity: warning
      - alert: SamplesMissingSecret
        annotations:
          message: |
            Samples operator cannot find the samples pull secret in the openshift namespace.
        expr: openshift_samples_invalidsecret_info{reason="missing_secret"} == 1
        for: 2h
        labels:
          severity: warning
      - alert: SamplesMissingTBRCredential
        annotations:
          message: |
            The samples operator cannot find credentials for 'registry.redhat.io'. Many of the sample ImageStreams will fail to import unless the 'samplesRegistry' in the operator configuration is changed.
        expr: openshift_samples_invalidsecret_info{reason="missing_tbr_credential"}
          == 1
        for: 2h
        labels:
          severity: warning
      - alert: SamplesTBRInaccessibleOnBoot
        annotations:
          message: |
            Samples operator could not access 'registry.redhat.io' during its initial installation and it bootstrapped as removed.
            If this is expected, and stems from installing in a restricted network environment, please note that if you
            plan on mirroring images associated with sample imagestreams into a registry available in your restricted
            network environment, and subsequently moving samples operator back to 'Managed' state, a list of the images
            associated with each image stream tag from the samples catalog is
            provided in the 'imagestreamtag-to-image' config map in the 'openshift-cluster-samples-operator' namespace to
            assist the mirroring process.
        expr: openshift_samples_tbr_inaccessible_info == 1
        for: 2d
        labels:
          severity: info
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 3 Kevin Rizza 2021-08-20 18:29:52 UTC

That third message, about the ClusterSamplesOperator, should be broken out into a separate bz and is unrelated to the OLM component.

Comment 4 Tyler Slaton 2022-02-25 18:48:47 UTC

The fix for this got brought downstream via a sync PR: 

https://github.com/openshift/operator-framework-olm/pull/213

Comment 5 Bruno Andrade 2022-04-05 17:34:22 UTC


1, Install an OCP with the fixed PR.

oc get clusterversion          
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-01-172551   True        False         108m    Cluster version is 4.11.0-0.nightly-2022-04-01-172551

oc exec olm-operator-5c95d8557c-s2f2l  -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.19.0
git commit: 55503f8eaa40ec8aa02d0b1310609c7551988554


2. Check Alert Rules (all have summary and description fields)

oc get prometheusrules -n openshift-operator-lifecycle-manager -oyaml                                                                               
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
    creationTimestamp: "2022-04-05T14:56:42Z"
    generation: 1
    labels:
      prometheus: alert-rules
      role: alert-rules
    name: olm-alert-rules
    namespace: openshift-operator-lifecycle-manager
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 27981357-7f25-4ddc-8f03-e3d9e51df244
    resourceVersion: "1746"
    uid: 05e00325-8018-487a-8915-1b8d2cd77ecb
  spec:
    groups:
    - name: olm.csv_abnormal.rules
      rules:
      - alert: CsvAbnormalFailedOver2Min
        annotations:
          description: Fires whenever a CSV has been in the failed phase for more
            than 2 minutes.
          message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
            }}. Reason-{{ $labels.reason }}
          summary: CSV failed for over 2 minutes
        expr: csv_abnormal{phase=~"^Failed$"}
        for: 2m
        labels:
          namespace: '{{ $labels.namespace }}'
          severity: warning
      - alert: CsvAbnormalOver30Min
        annotations:
          description: Fires whenever a CSV is in the Replacing, Pending, Deleting,
            or Unkown phase for more than 30 minutes.
          message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
            }}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }}
          summary: CSV abnormal for over 30 minutes
        expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"}
        for: 30m
        labels:
          namespace: '{{ $labels.namespace }}'
          severity: warning
    - name: olm.installplan.rules
      rules:
      - alert: InstallPlanStepAppliedWithWarnings
        annotations:
          description: Fires whenever the API server returns a warning when attempting
            to modify an operator.
          message: The API server returned a warning during installation or upgrade
            of an operator. An Event with reason "AppliedWithWarnings" has been created
            with complete details, including a reference to the InstallPlan step that
            generated the warning.
          summary: API returned a warning when modifying an operator
        expr: sum(increase(installplan_warnings_total[5m])) > 0
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

  oc get prometheusrules -n openshift-marketplace -oyaml
 apiVersion: v1
 items:
 - apiVersion: monitoring.coreos.com/v1
   kind: PrometheusRule
   metadata:
     annotations:
       capability.openshift.io/name: marketplace
       include.release.openshift.io/ibm-cloud-managed: "true"
       include.release.openshift.io/self-managed-high-availability: "true"
       include.release.openshift.io/single-node-developer: "true"
     creationTimestamp: "2022-04-05T15:02:23Z"
     generation: 1
     labels:
       prometheus: alert-rules
       role: alert-rules
     name: marketplace-alert-rules
     namespace: openshift-marketplace
     ownerReferences:
     - apiVersion: config.openshift.io/v1
       kind: ClusterVersion
       name: version
       uid: 27981357-7f25-4ddc-8f03-e3d9e51df244
     resourceVersion: "10198"
     uid: 665cc52e-d664-4e2c-99d4-43a8ce7df05a
   spec:
     groups:
     - name: marketplace.community_operators.rules
       rules:
       - alert: CommunityOperatorsCatalogError
         annotations:
           description: Fires whenever the community-operators source is not ready
             for more than 10 mins.
           message: Default OperatorHub source "community-operators" is in Non-Ready
             state for more than 10 mins.
           summary: Community-operators not ready for 10 minutes
         expr: catalogsource_ready{name="community-operators",exported_namespace="openshift-marketplace"}
           == 0
         for: 10m
         labels:
           severity: warning
     - name: marketplace.certified_operators.rules
       rules:
       - alert: CertifiedOperatorsCatalogError
         annotations:
           description: Fires whenever the certified-operators source is not ready
             for more than 10 mins.
           message: Default OperatorHub source "certified-operators" is in Non-Ready
             state for more than 10 mins.
           summary: Certified-operators not ready for more than 10 minutes
         expr: catalogsource_ready{name="certified-operators",exported_namespace="openshift-marketplace"}
           == 0
         for: 10m
         labels:
           severity: warning
     - name: marketplace.redhat_operators.rules
       rules:
       - alert: RedhatOperatorsCatalogError
         annotations:
           description: Fires whenever the redhat-operators source is not ready for
             more than 10 mins.
           message: Default OperatorHub source "redhat-operators" is in Non-Ready state
             for more than 10 mins.
           summary: Redhat-operators not ready for more than 10 minutes
         expr: catalogsource_ready{name="redhat-operators",exported_namespace="openshift-marketplace"}
           == 0
         for: 10m
         labels:
           severity: warning
     - name: marketplace.redhat_marketplace.rules
       rules:
       - alert: RedhatMarketplaceCatalogError
         annotations:
           description: Fires whenever the redhat-marketplace source is not ready for
             more than 10 mins.
           message: Default OperatorHub source "redhat-marketplace" is in Non-Ready
             state for more than 10 mins.
           summary: Redhat-marketplace not ready for more than 10 minutes
         expr: catalogsource_ready{name="redhat-marketplace",exported_namespace="openshift-marketplace"}
           == 0
         for: 10m
         labels:
           severity: warning
 kind: List
 metadata:
   resourceVersion: ""
   selfLink: "


Marked as VERIFIED

Comment 10 errata-xmlrpc 2022-09-08 05:41:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.10.31 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6258

Note You need to log in before you can comment on or make changes to this bug.