Description of problem: Alerts regarding CSV failures are not very accurate and not targeted to core namespaces
This bug report is quite vague and doesn't really have any actionable information in it. I'm closing it as INSUFFICIENT_DATA. Feel free to reopen with more explicit description or explanation of what the specific defect is or what is being asked for.
CsvAbnormalReplacingOver30Min and CsvAbnormalReplacingOver4Hr should be added in order to get better insights into potentially bad behavior during CSV replacement. Those alerts should additionally have the namespace present that they originate from to ensure a possibility to route them easily via alertmanager.
This has been fixed on master, but still an issue on 4.6 and 4.7
LGTM, marking as VERIFIED. OCP Version: 4.10.0-0.nightly-2021-10-05-121338 OLM version: 0.18.3 git commit: a768ef8e86e00e25fa8612dbf9f6984721449255 oc get prometheusrules.monitoring.coreos.com olm-alert-rules -n openshift-operator-lifecycle-manager -o yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-10-05T15:56:47Z" generation: 1 labels: prometheus: alert-rules role: alert-rules name: olm-alert-rules namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 28bbc3d2-a454-4187-a57d-c0a07d220a76 resourceVersion: "1757" uid: eabba1ca-69a0-4a8f-8999-1640d5fe72e3 spec: groups: - name: olm.csv_abnormal.rules rules: - alert: CsvAbnormalFailedOver2Min annotations: message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Reason-{{ $labels.reason }} expr: csv_abnormal{phase=~"^Failed$"} for: 2m labels: namespace: '{{ $labels.namespace }}' severity: warning - alert: CsvAbnormalOver30Min annotations: message: Failed to install Operator {{ $labels.name }} version {{ $labels.version }}. Phase-{{ $labels.phase }} Reason-{{ $labels.reason }} expr: csv_abnormal{phase=~"(^Replacing$|^Pending$|^Deleting$|^Unknown$)"} for: 30m labels: namespace: '{{ $labels.namespace }}' severity: warning - name: olm.installplan.rules rules: - alert: InstallPlanStepAppliedWithWarnings annotations: message: The API server returned a warning during installation or upgrade of an operator. An Event with reason "AppliedWithWarnings" has been created with complete details, including a reference to the InstallPlan step that generated the warning. expr: sum(increase(installplan_warnings_total[5m])) > 0 labels: severity: warning