Bug 1992531
Summary: | all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | hongyan li <hongyli> |
Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
Status: | CLOSED DEFERRED | Docs Contact: | |
Severity: | medium | ||
Priority: | low | CC: | aos-bugs, jerzhang, kgarriso, mkrejci |
Version: | 4.9 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-07 20:21:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
hongyan li
2021-08-11 09:25:12 UTC
The following alert rules have issue also $ oc get prometheusrules -n openshift-machine-api -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:15Z" generation: 1 labels: prometheus: k8s role: alert-rules name: machine-api-operator-prometheus-rules namespace: openshift-machine-api ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "2116" uid: cbef9898-c41a-4f18-8256-2763485dc37d spec: groups: - name: machine-without-valid-node-ref rules: - alert: MachineWithoutValidNode annotations: message: machine {{ $labels.name }} does not have valid node reference expr: | (mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0 for: 60m labels: severity: warning - name: machine-with-no-running-phase rules: - alert: MachineWithNoRunningPhase annotations: message: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}' expr: | (mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0 for: 60m labels: severity: warning - name: machine-not-yet-deleted rules: - alert: MachineNotYetDeleted annotations: message: machine {{ $labels.name }} has been in Deleting phase for more than 6 hours expr: | (mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0 for: 360m labels: severity: warning - name: machine-api-operator-metrics-collector-up rules: - alert: MachineAPIOperatorMetricsCollectionFailing annotations: message: 'machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api' expr: | mapi_mao_collector_up == 0 for: 5m labels: severity: critical - name: machine-health-check-unterminated-short-circuit rules: - alert: MachineHealthCheckUnterminatedShortCircuit expr: | mapi_machinehealthcheck_short_circuit == 1 for: 30m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" The following alert rules also have issue $ oc get prometheusrules -n openshift-cluster-machine-approver -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:00Z" generation: 1 labels: prometheus: k8s role: alert-rules name: machineapprover-rules namespace: openshift-cluster-machine-approver ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1700" uid: 9165d040-d214-4ea4-8099-46136348eb83 spec: groups: - name: cluster-machine-approver.rules rules: - alert: MachineApproverMaxPendingCSRsReached annotations: message: max pending CSRs threshold reached. expr: | mapi_current_pending_csr > mapi_max_pending_csr for: 5m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" The MCO does have an alerting epic https://issues.redhat.com/browse/GRPA-2741. Moving over to Kirsten to take a look at whether this is relevant as part of the epic. Thank Jerry. Hongyan: As a note: we are the Machine Config Operator. Please open bugs in the correct components to cover the alerts mentioned in comments 1(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c1) and 2(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c2). They are not MCO alerts (see the namespaces). Please confirm that you will be opening separate bugs in ther relevant components. I will file separate bugs, you can just fix issue in https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c0 Closing as this work will be tracked as part of https://issues.redhat.com/browse/MCO-1 |