Bug 1992510
| Summary: | all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | hongyan li <hongyli> |
| Component: | OLM | Assignee: | Tyler Slaton <tyslaton> |
| OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | bandrade, pegoncal, tyslaton, vdinh, vlaad |
| Version: | 4.9 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-08 05:41:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The following alert rule has issue also
$ oc get prometheusrules -n openshift-marketplace -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
creationTimestamp: "2021-08-10T23:15:02Z"
generation: 1
labels:
prometheus: alert-rules
role: alert-rules
name: marketplace-alert-rules
namespace: openshift-marketplace
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
resourceVersion: "9455"
uid: 86c14ae3-a174-4139-ab8a-dc697c29c170
spec:
groups:
- name: marketplace.community_operators.rules
rules:
- alert: CommunityOperatorsCatalogError
annotations:
message: Default OperatorHub source "community-operators" is in Non-Ready
state for more than 10 mins.
expr: catalogsource_ready{name="community-operators",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
- name: marketplace.certified_operators.rules
rules:
- alert: CertifiedOperatorsCatalogError
annotations:
message: Default OperatorHub source "certified-operators" is in Non-Ready
state for more than 10 mins.
expr: catalogsource_ready{name="certified-operators",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
- name: marketplace.redhat_operators.rules
rules:
- alert: RedhatOperatorsCatalogError
annotations:
message: Default OperatorHub source "redhat-operators" is in Non-Ready state
for more than 10 mins.
expr: catalogsource_ready{name="redhat-operators",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
- name: marketplace.redhat_marketplace.rules
rules:
- alert: RedhatMarketplaceCatalogError
annotations:
message: Default OperatorHub source "redhat-marketplace" is in Non-Ready
state for more than 10 mins.
expr: catalogsource_ready{name="redhat-marketplace",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: ""
The following rules have issue also
$ oc get prometheusrules -n openshift-cluster-samples-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
creationTimestamp: "2021-08-10T23:12:00Z"
generation: 1
labels:
name: samples-operator-alerts
name: samples-operator-alerts
namespace: openshift-cluster-samples-operator
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
resourceVersion: "1695"
uid: 809630a2-08c8-483b-830e-06698fff640e
spec:
groups:
- name: SamplesOperator
rules:
- alert: SamplesRetriesMissingOnImagestreamImportFailing
annotations:
message: |
Samples operator is detecting problems with imagestream image imports, and the periodic retries of those
imports are not occurring. Contact support. You can look at the "openshift-samples" ClusterOperator object
for details. Most likely there are issues with the external image registry hosting the images that need to
be investigated. The list of ImageStreams that have failing imports are:
{{ range query "openshift_samples_failed_imagestream_import_info > 0" }}
{{ .Labels.name }}
{{ end }}
However, the list of ImageStreams for which samples operator is retrying imports is:
retrying imports:
{{ range query "openshift_samples_retry_imagestream_import_total > 0" }}
{{ .Labels.imagestreamname }}
{{ end }}
expr: sum(openshift_samples_failed_imagestream_import_info) > sum(openshift_samples_retry_imagestream_import_total)
- sum(openshift_samples_retry_imagestream_import_total offset 30m)
for: 2h
labels:
severity: warning
- alert: SamplesImagestreamImportFailing
annotations:
message: |
Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples"
ClusterOperator object for details. Most likely there are issues with the external image registry hosting
the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not
care about having sample imagestreams available. The list of ImageStreams for which samples operator is
retrying imports:
{{ range query "openshift_samples_retry_imagestream_import_total > 0" }}
{{ .Labels.imagestreamname }}
{{ end }}
expr: sum(openshift_samples_retry_imagestream_import_total) - sum(openshift_samples_retry_imagestream_import_total
offset 30m) > sum(openshift_samples_failed_imagestream_import_info)
for: 2h
labels:
severity: warning
- alert: SamplesDegraded
annotations:
message: |
Samples could not be deployed and the operator is degraded. Review the "openshift-samples" ClusterOperator object for further details.
expr: openshift_samples_degraded_info == 1
for: 2h
labels:
severity: warning
- alert: SamplesInvalidConfig
annotations:
message: |
Samples operator has been given an invalid configuration.
expr: openshift_samples_invalidconfig_info == 1
for: 2h
labels:
severity: warning
- alert: SamplesMissingSecret
annotations:
message: |
Samples operator cannot find the samples pull secret in the openshift namespace.
expr: openshift_samples_invalidsecret_info{reason="missing_secret"} == 1
for: 2h
labels:
severity: warning
- alert: SamplesMissingTBRCredential
annotations:
message: |
The samples operator cannot find credentials for 'registry.redhat.io'. Many of the sample ImageStreams will fail to import unless the 'samplesRegistry' in the operator configuration is changed.
expr: openshift_samples_invalidsecret_info{reason="missing_tbr_credential"}
== 1
for: 2h
labels:
severity: warning
- alert: SamplesTBRInaccessibleOnBoot
annotations:
message: |
Samples operator could not access 'registry.redhat.io' during its initial installation and it bootstrapped as removed.
If this is expected, and stems from installing in a restricted network environment, please note that if you
plan on mirroring images associated with sample imagestreams into a registry available in your restricted
network environment, and subsequently moving samples operator back to 'Managed' state, a list of the images
associated with each image stream tag from the samples catalog is
provided in the 'imagestreamtag-to-image' config map in the 'openshift-cluster-samples-operator' namespace to
assist the mirroring process.
expr: openshift_samples_tbr_inaccessible_info == 1
for: 2d
labels:
severity: info
kind: List
metadata:
resourceVersion: ""
selfLink: ""
That third message, about the ClusterSamplesOperator, should be broken out into a separate bz and is unrelated to the OLM component. The fix for this got brought downstream via a sync PR: https://github.com/openshift/operator-framework-olm/pull/213
1, Install an OCP with the fixed PR.
oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-01-172551 True False 108m Cluster version is 4.11.0-0.nightly-2022-04-01-172551
oc exec olm-operator-5c95d8557c-s2f2l -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.19.0
git commit: 55503f8eaa40ec8aa02d0b1310609c7551988554
2. Check Alert Rules (all have summary and description fields)
oc get prometheusrules -n openshift-operator-lifecycle-manager -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
creationTimestamp: "2022-04-05T14:56:42Z"
generation: 1
labels:
prometheus: alert-rules
role: alert-rules
name: olm-alert-rules
namespace: openshift-operator-lifecycle-manager
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 27981357-7f25-4ddc-8f03-e3d9e51df244
resourceVersion: "1746"
uid: 05e00325-8018-487a-8915-1b8d2cd77ecb
spec:
groups:
- name: olm.csv_abnormal.rules
rules:
- alert: CsvAbnormalFailedOver2Min
annotations:
description: Fires whenever a CSV has been in the failed phase for more
than 2 minutes.
message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
}}. Reason-{{ $labels.reason }}
summary: CSV failed for over 2 minutes
expr: csv_abnormal{phase=~"^Failed$"}
for: 2m
labels:
namespace: '{{ $labels.namespace }}'
severity: warning
- alert: CsvAbnormalOver30Min
annotations:
description: Fires whenever a CSV is in the Replacing, Pending, Deleting,
or Unkown phase for more than 30 minutes.
message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
}}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }}
summary: CSV abnormal for over 30 minutes
expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"}
for: 30m
labels:
namespace: '{{ $labels.namespace }}'
severity: warning
- name: olm.installplan.rules
rules:
- alert: InstallPlanStepAppliedWithWarnings
annotations:
description: Fires whenever the API server returns a warning when attempting
to modify an operator.
message: The API server returned a warning during installation or upgrade
of an operator. An Event with reason "AppliedWithWarnings" has been created
with complete details, including a reference to the InstallPlan step that
generated the warning.
summary: API returned a warning when modifying an operator
expr: sum(increase(installplan_warnings_total[5m])) > 0
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: ""
oc get prometheusrules -n openshift-marketplace -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
capability.openshift.io/name: marketplace
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
creationTimestamp: "2022-04-05T15:02:23Z"
generation: 1
labels:
prometheus: alert-rules
role: alert-rules
name: marketplace-alert-rules
namespace: openshift-marketplace
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 27981357-7f25-4ddc-8f03-e3d9e51df244
resourceVersion: "10198"
uid: 665cc52e-d664-4e2c-99d4-43a8ce7df05a
spec:
groups:
- name: marketplace.community_operators.rules
rules:
- alert: CommunityOperatorsCatalogError
annotations:
description: Fires whenever the community-operators source is not ready
for more than 10 mins.
message: Default OperatorHub source "community-operators" is in Non-Ready
state for more than 10 mins.
summary: Community-operators not ready for 10 minutes
expr: catalogsource_ready{name="community-operators",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
- name: marketplace.certified_operators.rules
rules:
- alert: CertifiedOperatorsCatalogError
annotations:
description: Fires whenever the certified-operators source is not ready
for more than 10 mins.
message: Default OperatorHub source "certified-operators" is in Non-Ready
state for more than 10 mins.
summary: Certified-operators not ready for more than 10 minutes
expr: catalogsource_ready{name="certified-operators",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
- name: marketplace.redhat_operators.rules
rules:
- alert: RedhatOperatorsCatalogError
annotations:
description: Fires whenever the redhat-operators source is not ready for
more than 10 mins.
message: Default OperatorHub source "redhat-operators" is in Non-Ready state
for more than 10 mins.
summary: Redhat-operators not ready for more than 10 minutes
expr: catalogsource_ready{name="redhat-operators",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
- name: marketplace.redhat_marketplace.rules
rules:
- alert: RedhatMarketplaceCatalogError
annotations:
description: Fires whenever the redhat-marketplace source is not ready for
more than 10 mins.
message: Default OperatorHub source "redhat-marketplace" is in Non-Ready
state for more than 10 mins.
summary: Redhat-marketplace not ready for more than 10 minutes
expr: catalogsource_ready{name="redhat-marketplace",exported_namespace="openshift-marketplace"}
== 0
for: 10m
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: "
Marked as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.10.31 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6258 |
Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-operator-lifecycle-manager -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:04Z" generation: 1 labels: prometheus: alert-rules role: alert-rules name: olm-alert-rules namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1794" uid: 775f9fba-10e5-4e5e-94ad-fa1dd8878e73 spec: groups: - name: olm.csv_abnormal.rules rules: - alert: CsvAbnormalFailedOver2Min annotations: message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Reason-{{ $labels.reason }} expr: csv_abnormal{phase=~"^Failed$"} for: 2m labels: namespace: '{{ $labels.namespace }}' severity: warning - alert: CsvAbnormalOver30Min annotations: message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }} expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"} for: 30m labels: namespace: '{{ $labels.namespace }}' severity: warning - name: olm.installplan.rules rules: - alert: InstallPlanStepAppliedWithWarnings annotations: message: The API server returned a warning during installation or upgrade of an operator. An Event with reason "AppliedWithWarnings" has been created with complete details, including a reference to the InstallPlan step that generated the warning. expr: sum(increase(installplan_warnings_total[5m])) > 0 labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required