Bug 1992531
| Summary: | all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | hongyan li <hongyli> |
| Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | medium | ||
| Priority: | low | CC: | aos-bugs, jerzhang, kgarriso, mkrejci |
| Version: | 4.9 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-07 20:21:31 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The following alert rules have issue also
$ oc get prometheusrules -n openshift-machine-api -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
exclude.release.openshift.io/internal-openshift-hosted: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
creationTimestamp: "2021-08-10T23:12:15Z"
generation: 1
labels:
prometheus: k8s
role: alert-rules
name: machine-api-operator-prometheus-rules
namespace: openshift-machine-api
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
resourceVersion: "2116"
uid: cbef9898-c41a-4f18-8256-2763485dc37d
spec:
groups:
- name: machine-without-valid-node-ref
rules:
- alert: MachineWithoutValidNode
annotations:
message: machine {{ $labels.name }} does not have valid node reference
expr: |
(mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0
for: 60m
labels:
severity: warning
- name: machine-with-no-running-phase
rules:
- alert: MachineWithNoRunningPhase
annotations:
message: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}'
expr: |
(mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0
for: 60m
labels:
severity: warning
- name: machine-not-yet-deleted
rules:
- alert: MachineNotYetDeleted
annotations:
message: machine {{ $labels.name }} has been in Deleting phase for more
than 6 hours
expr: |
(mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0
for: 360m
labels:
severity: warning
- name: machine-api-operator-metrics-collector-up
rules:
- alert: MachineAPIOperatorMetricsCollectionFailing
annotations:
message: 'machine api operator metrics collection is failing. For more details: oc
logs <machine-api-operator-pod-name> -n openshift-machine-api'
expr: |
mapi_mao_collector_up == 0
for: 5m
labels:
severity: critical
- name: machine-health-check-unterminated-short-circuit
rules:
- alert: MachineHealthCheckUnterminatedShortCircuit
expr: |
mapi_machinehealthcheck_short_circuit == 1
for: 30m
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: ""
The following alert rules also have issue
$ oc get prometheusrules -n openshift-cluster-machine-approver -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
exclude.release.openshift.io/internal-openshift-hosted: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
creationTimestamp: "2021-08-10T23:12:00Z"
generation: 1
labels:
prometheus: k8s
role: alert-rules
name: machineapprover-rules
namespace: openshift-cluster-machine-approver
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
resourceVersion: "1700"
uid: 9165d040-d214-4ea4-8099-46136348eb83
spec:
groups:
- name: cluster-machine-approver.rules
rules:
- alert: MachineApproverMaxPendingCSRsReached
annotations:
message: max pending CSRs threshold reached.
expr: |
mapi_current_pending_csr > mapi_max_pending_csr
for: 5m
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: ""
The MCO does have an alerting epic https://issues.redhat.com/browse/GRPA-2741. Moving over to Kirsten to take a look at whether this is relevant as part of the epic. Thank Jerry. Hongyan: As a note: we are the Machine Config Operator. Please open bugs in the correct components to cover the alerts mentioned in comments 1(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c1) and 2(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c2). They are not MCO alerts (see the namespaces). Please confirm that you will be opening separate bugs in ther relevant components. I will file separate bugs, you can just fix issue in https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c0 Closing as this work will be tracked as part of https://issues.redhat.com/browse/MCO-1 |
Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-machine-config-operator -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:03Z" generation: 1 labels: k8s-app: machine-config-daemon name: machine-config-daemon namespace: openshift-machine-config-operator ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1792" uid: f565a74a-ebda-404a-85f0-0a317e2b3e49 spec: groups: - name: mcd-reboot-error rules: - alert: MCDRebootError annotations: message: Reboot failed on {{ $labels.node }} , update may be blocked expr: | mcd_reboot_err > 0 labels: severity: critical - name: mcd-drain-error rules: - alert: MCDDrainError annotations: message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For more details: oc logs -f -n {{ $labels.namespace }} {{ $labels.pod }} -c machine-config-daemon' expr: | mcd_drain_err > 0 labels: severity: warning - name: mcd-pivot-error rules: - alert: MCDPivotError annotations: message: 'Error detected in pivot logs on {{ $labels.node }} ' expr: | mcd_pivot_err > 0 labels: severity: warning - name: mcd-kubelet-health-state-error rules: - alert: KubeletHealthState annotations: message: Kubelet health failure threshold reached expr: | mcd_kubelet_state > 2 labels: severity: warning - name: system-memory-exceeds-reservation rules: - alert: SystemMemoryExceedsReservation annotations: message: System memory usage of {{ $value | humanize }} on {{ $labels.node }} exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default reservation is expected to be sufficient for most configurations and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods (either due to rate of change or at steady state). expr: | sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9) for: 15m labels: severity: warning - name: master-nodes-high-memory-usage rules: - alert: MasterNodesHighMemoryUsage annotations: message: Memory usage of {{ $value | humanize }} on {{ $labels.node }} exceeds 90%. Master nodes starved of memory could result in degraded performance of the control plane. expr: | ((sum(node_memory_MemTotal_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" ))) / sum(node_memory_MemTotal_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )) * 100) > 90 for: 15m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required