Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-machine-config-operator -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:03Z" generation: 1 labels: k8s-app: machine-config-daemon name: machine-config-daemon namespace: openshift-machine-config-operator ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1792" uid: f565a74a-ebda-404a-85f0-0a317e2b3e49 spec: groups: - name: mcd-reboot-error rules: - alert: MCDRebootError annotations: message: Reboot failed on {{ $labels.node }} , update may be blocked expr: | mcd_reboot_err > 0 labels: severity: critical - name: mcd-drain-error rules: - alert: MCDDrainError annotations: message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For more details: oc logs -f -n {{ $labels.namespace }} {{ $labels.pod }} -c machine-config-daemon' expr: | mcd_drain_err > 0 labels: severity: warning - name: mcd-pivot-error rules: - alert: MCDPivotError annotations: message: 'Error detected in pivot logs on {{ $labels.node }} ' expr: | mcd_pivot_err > 0 labels: severity: warning - name: mcd-kubelet-health-state-error rules: - alert: KubeletHealthState annotations: message: Kubelet health failure threshold reached expr: | mcd_kubelet_state > 2 labels: severity: warning - name: system-memory-exceeds-reservation rules: - alert: SystemMemoryExceedsReservation annotations: message: System memory usage of {{ $value | humanize }} on {{ $labels.node }} exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default reservation is expected to be sufficient for most configurations and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods (either due to rate of change or at steady state). expr: | sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9) for: 15m labels: severity: warning - name: master-nodes-high-memory-usage rules: - alert: MasterNodesHighMemoryUsage annotations: message: Memory usage of {{ $value | humanize }} on {{ $labels.node }} exceeds 90%. Master nodes starved of memory could result in degraded performance of the control plane. expr: | ((sum(node_memory_MemTotal_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" ))) / sum(node_memory_MemTotal_bytes AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )) * 100) > 90 for: 15m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required
The following alert rules have issue also $ oc get prometheusrules -n openshift-machine-api -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:15Z" generation: 1 labels: prometheus: k8s role: alert-rules name: machine-api-operator-prometheus-rules namespace: openshift-machine-api ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "2116" uid: cbef9898-c41a-4f18-8256-2763485dc37d spec: groups: - name: machine-without-valid-node-ref rules: - alert: MachineWithoutValidNode annotations: message: machine {{ $labels.name }} does not have valid node reference expr: | (mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0 for: 60m labels: severity: warning - name: machine-with-no-running-phase rules: - alert: MachineWithNoRunningPhase annotations: message: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}' expr: | (mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0 for: 60m labels: severity: warning - name: machine-not-yet-deleted rules: - alert: MachineNotYetDeleted annotations: message: machine {{ $labels.name }} has been in Deleting phase for more than 6 hours expr: | (mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0 for: 360m labels: severity: warning - name: machine-api-operator-metrics-collector-up rules: - alert: MachineAPIOperatorMetricsCollectionFailing annotations: message: 'machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api' expr: | mapi_mao_collector_up == 0 for: 5m labels: severity: critical - name: machine-health-check-unterminated-short-circuit rules: - alert: MachineHealthCheckUnterminatedShortCircuit expr: | mapi_machinehealthcheck_short_circuit == 1 for: 30m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: ""
The following alert rules also have issue $ oc get prometheusrules -n openshift-cluster-machine-approver -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-10T23:12:00Z" generation: 1 labels: prometheus: k8s role: alert-rules name: machineapprover-rules namespace: openshift-cluster-machine-approver ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f resourceVersion: "1700" uid: 9165d040-d214-4ea4-8099-46136348eb83 spec: groups: - name: cluster-machine-approver.rules rules: - alert: MachineApproverMaxPendingCSRsReached annotations: message: max pending CSRs threshold reached. expr: | mapi_current_pending_csr > mapi_max_pending_csr for: 5m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: ""
The MCO does have an alerting epic https://issues.redhat.com/browse/GRPA-2741. Moving over to Kirsten to take a look at whether this is relevant as part of the epic.
Thank Jerry. Hongyan: As a note: we are the Machine Config Operator. Please open bugs in the correct components to cover the alerts mentioned in comments 1(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c1) and 2(https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c2). They are not MCO alerts (see the namespaces). Please confirm that you will be opening separate bugs in ther relevant components.
I will file separate bugs, you can just fix issue in https://bugzilla.redhat.com/show_bug.cgi?id=1992531#c0
Closing as this work will be tracked as part of https://issues.redhat.com/browse/MCO-1