Bug 1992531

Summary: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines
Product: OpenShift Container Platform Reporter: hongyan li <hongyli>
Component: Machine Config OperatorAssignee: MCO Team <team-mco>
Machine Config Operator sub component: Machine Config Operator QA Contact: Rio Liu <rioliu>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: low CC: aos-bugs, jerzhang, kgarriso, mkrejci
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-07 20:21:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description hongyan li 2021-08-11 09:25:12 UTC
Description of problem:
all the alert rules'  annotations "summary" and "description"  should comply with the OpenShift alerting guidelines

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-07-175228

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:
$ oc get prometheusrules -n openshift-machine-config-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:12:03Z"
    generation: 1
    labels:
      k8s-app: machine-config-daemon
    name: machine-config-daemon
    namespace: openshift-machine-config-operator
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "1792"
    uid: f565a74a-ebda-404a-85f0-0a317e2b3e49
  spec:
    groups:
    - name: mcd-reboot-error
      rules:
      - alert: MCDRebootError
        annotations:
          message: Reboot failed on {{ $labels.node }} , update may be blocked
        expr: |
          mcd_reboot_err > 0
        labels:
          severity: critical
    - name: mcd-drain-error
      rules:
      - alert: MCDDrainError
        annotations:
          message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For
            more details:  oc logs -f -n {{ $labels.namespace }} {{ $labels.pod }}
            -c machine-config-daemon'
        expr: |
          mcd_drain_err > 0
        labels:
          severity: warning
    - name: mcd-pivot-error
      rules:
      - alert: MCDPivotError
        annotations:
          message: 'Error detected in pivot logs on {{ $labels.node }} '
        expr: |
          mcd_pivot_err > 0
        labels:
          severity: warning
    - name: mcd-kubelet-health-state-error
      rules:
      - alert: KubeletHealthState
        annotations:
          message: Kubelet health failure threshold reached
        expr: |
          mcd_kubelet_state > 2
        labels:
          severity: warning
    - name: system-memory-exceeds-reservation
      rules:
      - alert: SystemMemoryExceedsReservation
        annotations:
          message: System memory usage of {{ $value | humanize }} on {{ $labels.node
            }} exceeds 90% of the reservation. Reserved memory ensures system processes
            can function even when the node is fully allocated and protects against
            workload out of memory events impacting the proper functioning of the
            node. The default reservation is expected to be sufficient for most configurations
            and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html)
            when running nodes with high numbers of pods (either due to rate of change
            or at steady state).
        expr: |
          sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)
        for: 15m
        labels:
          severity: warning
    - name: master-nodes-high-memory-usage
      rules:
      - alert: MasterNodesHighMemoryUsage
        annotations:
          message: Memory usage of {{ $value | humanize }} on {{ $labels.node }} exceeds
            90%. Master nodes starved of memory could result in degraded performance
            of the control plane.
        expr: |
          ((sum(node_memory_MemTotal_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" ))) / sum(node_memory_MemTotal_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )) * 100) > 90
        for: 15m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


Expected results:
alert rules have annotations "summary" and "description"

Additional info:
the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1]

[1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Comment 1 hongyan li 2021-08-11 09:33:23 UTC
The following alert rules have issue also
$ oc get prometheusrules -n openshift-machine-api -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:12:15Z"
    generation: 1
    labels:
      prometheus: k8s
      role: alert-rules
    name: machine-api-operator-prometheus-rules
    namespace: openshift-machine-api
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "2116"
    uid: cbef9898-c41a-4f18-8256-2763485dc37d
  spec:
    groups:
    - name: machine-without-valid-node-ref
      rules:
      - alert: MachineWithoutValidNode
        annotations:
          message: machine {{ $labels.name }} does not have valid node reference
        expr: |
          (mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0
        for: 60m
        labels:
          severity: warning
    - name: machine-with-no-running-phase
      rules:
      - alert: MachineWithNoRunningPhase
        annotations:
          message: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}'
        expr: |
          (mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0
        for: 60m
        labels:
          severity: warning
    - name: machine-not-yet-deleted
      rules:
      - alert: MachineNotYetDeleted
        annotations:
          message: machine {{ $labels.name }} has been in Deleting phase for more
            than 6 hours
        expr: |
          (mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0
        for: 360m
        labels:
          severity: warning
    - name: machine-api-operator-metrics-collector-up
      rules:
      - alert: MachineAPIOperatorMetricsCollectionFailing
        annotations:
          message: 'machine api operator metrics collection is failing. For more details:  oc
            logs <machine-api-operator-pod-name> -n openshift-machine-api'
        expr: |
          mapi_mao_collector_up == 0
        for: 5m
        labels:
          severity: critical
    - name: machine-health-check-unterminated-short-circuit
      rules:
      - alert: MachineHealthCheckUnterminatedShortCircuit
        expr: |
          mapi_machinehealthcheck_short_circuit == 1
        for: 30m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 2 hongyan li 2021-08-11 10:20:56 UTC
The following alert rules also have issue
$ oc get prometheusrules -n openshift-cluster-machine-approver -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      exclude.release.openshift.io/internal-openshift-hosted: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:12:00Z"
    generation: 1
    labels:
      prometheus: k8s
      role: alert-rules
    name: machineapprover-rules
    namespace: openshift-cluster-machine-approver
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "1700"
    uid: 9165d040-d214-4ea4-8099-46136348eb83
  spec:
    groups:
    - name: cluster-machine-approver.rules
      rules:
      - alert: MachineApproverMaxPendingCSRsReached
        annotations:
          message: max pending CSRs threshold reached.
        expr: |
          mapi_current_pending_csr > mapi_max_pending_csr
        for: 5m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 3 Yu Qi Zhang 2021-08-12 03:12:19 UTC
The MCO does have an alerting epic https://issues.redhat.com/browse/GRPA-2741. Moving over to Kirsten to take a look at whether this is relevant as part of the epic.

Comment 4 Kirsten Garrison 2021-08-12 16:33:01 UTC
Thank Jerry.

Hongyan:

As a note: we are the Machine Config Operator. Please open bugs in the correct components to cover the alerts mentioned in comments 1(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c1) and 2(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c2). They are not MCO alerts (see the namespaces).

Please confirm that you will be opening separate bugs in ther relevant components.

Comment 5 hongyan li 2021-08-13 07:44:58 UTC
I will file separate bugs, you can just fix issue in https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c0

Comment 6 Kirsten Garrison 2022-03-07 20:21:31 UTC
Closing as this work will be tracked as part of https://issues.redhat.com/browse/MCO-1