Bug 1916624 - CSV alerts inaccurate
Summary: CSV alerts inaccurate
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Kevin Rizza
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-15 09:57 UTC by Rick Rackow
Modified: 2022-11-19 05:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-03 22:47:22 UTC
Target Upstream Version:
Embargoed:
ankithom: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1653 0 None open Bug 1916624: Add CsvAbnornalReplacing alerts 2021-02-15 15:14:23 UTC

Description Rick Rackow 2021-01-15 09:57:42 UTC
Description of problem:

Alerts regarding CSV failures are not very accurate and not targeted to core namespaces

Comment 1 Kevin Rizza 2021-02-03 22:47:22 UTC
This bug report is quite vague and doesn't really have any actionable information in it. I'm closing it as INSUFFICIENT_DATA. Feel free to reopen with more explicit description or explanation of what the specific defect is or what is being asked for.

Comment 2 Rick Rackow 2021-02-08 16:11:42 UTC
CsvAbnormalReplacingOver30Min and CsvAbnormalReplacingOver4Hr should be added in order to get better insights into potentially bad behavior during CSV replacement.

Those alerts should additionally have the namespace present that they originate from to ensure a possibility to route them easily via alertmanager.

Comment 5 Rick Rackow 2021-07-13 16:00:32 UTC
This has been fixed on master, but still an issue on 4.6 and 4.7

Comment 9 Bruno Andrade 2021-10-05 16:35:34 UTC
LGTM, marking as VERIFIED.

OCP Version: 4.10.0-0.nightly-2021-10-05-121338

OLM version: 0.18.3
git commit: a768ef8e86e00e25fa8612dbf9f6984721449255



oc get prometheusrules.monitoring.coreos.com olm-alert-rules -n openshift-operator-lifecycle-manager -o yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2021-10-05T15:56:47Z"
  generation: 1
  labels:
    prometheus: alert-rules
    role: alert-rules
  name: olm-alert-rules
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 28bbc3d2-a454-4187-a57d-c0a07d220a76
  resourceVersion: "1757"
  uid: eabba1ca-69a0-4a8f-8999-1640d5fe72e3
spec:
  groups:
  - name: olm.csv_abnormal.rules
    rules:
    - alert: CsvAbnormalFailedOver2Min
      annotations:
        message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
          }}. Reason-{{ $labels.reason }}
      expr: csv_abnormal{phase=~"^Failed$"}
      for: 2m
      labels:
        namespace: '{{ $labels.namespace }}'
        severity: warning
    - alert: CsvAbnormalOver30Min
      annotations:
        message: Failed to install Operator {{ $labels.name }} version {{ $labels.version
          }}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }}
      expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"}
      for: 30m
      labels:
        namespace: '{{ $labels.namespace }}'
        severity: warning
  - name: olm.installplan.rules
    rules:
    - alert: InstallPlanStepAppliedWithWarnings
      annotations:
        message: The API server returned a warning during installation or upgrade
          of an operator. An Event with reason "AppliedWithWarnings" has been created
          with complete details, including a reference to the InstallPlan step that
          generated the warning.
      expr: sum(increase(installplan_warnings_total[5m])) > 0
      labels:
        severity: warning


Note You need to log in before you can comment on or make changes to this bug.