Bug 1992507

Summary: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines
Product: OpenShift Container Platform Reporter: hongyan li <hongyli>
Component: NetworkingAssignee: Christoph Stäbler <cstabler>
Networking sub component: openshift-sdn QA Contact: Ying Wang <yingwang>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: astoycos
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:45:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description hongyan li 2021-08-11 09:09:25 UTC
Description of problem:
all the alert rules'  annotations "summary" and "description"  should comply with the OpenShift alerting guidelines

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-07-175228

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:
$ oc get prometheusrules -n openshift-sdn -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      networkoperator.openshift.io/ignore-errors: ""
    creationTimestamp: "2021-08-10T23:12:52Z"
    generation: 1
    labels:
      prometheus: k8s
      role: alert-rules
    name: networking-rules
    namespace: openshift-sdn
    ownerReferences:
    - apiVersion: operator.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: Network
      name: cluster
      uid: f3f79f33-0ad6-4115-bb39-3dbd18324808
    resourceVersion: "2834"
    uid: 92537d98-cb35-4f74-9291-e1b6f3952277
  spec:
    groups:
    - name: cluster-network-operator-sdn.rules
      rules:
      - alert: NodeWithoutSDNPod
        annotations:
          message: |
            All nodes should be running an sdn pod, {{ $labels.node }} is not.
        expr: |
          (kube_node_info unless on(node) topk by (node) (1, kube_pod_info{namespace="openshift-sdn",  pod=~"sdn.*"})) > 0
        for: 10m
        labels:
          severity: warning
      - alert: NodeProxyApplySlow
        annotations:
          message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is taking
            too long, on average, to apply kubernetes service rules to iptables.
        expr: "histogram_quantile(.95, kubeproxy_sync_proxy_rules_duration_seconds_bucket)
          \n* on(namespace, pod) group_right topk by (namespace, pod) (1, kube_pod_info{namespace=\"openshift-sdn\",
          \ pod=~\"sdn-[^-]*\"}) > 15\n"
        labels:
          severity: warning
      - alert: ClusterProxyApplySlow
        annotations:
          message: The cluster is taking too long, on average, to apply kubernetes
            service rules to iptables.
        expr: |
          histogram_quantile(0.95, sum(rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])) by (le)) > 10
        labels:
          severity: warning
      - alert: NodeProxyApplyStale
        annotations:
          message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} has stale
            kubernetes service rules in iptables.
        expr: |
          (kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds - kubeproxy_sync_proxy_rules_last_timestamp_seconds)
          * on(namespace, pod) group_right() topk by (namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"})
          > 30
        for: 5m
        labels:
          severity: warning
      - alert: SDNPodNotReady
        annotations:
          message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.
        expr: |
          kube_pod_status_ready{namespace='openshift-sdn', condition='true'} == 0
        for: 10m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


Expected results:
alert rules have annotations "summary" and "description"

Additional info:
the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1]

[1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required

Comment 1 hongyan li 2021-08-11 09:56:38 UTC
The following rules have issue also
$ oc get prometheusrules -n openshift-ingress-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
    creationTimestamp: "2021-08-10T23:12:03Z"
    generation: 1
    labels:
      role: alert-rules
    name: ingress-operator
    namespace: openshift-ingress-operator
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
    resourceVersion: "1790"
    uid: 0efb31c0-440b-408b-aee6-aba2fa472459
  spec:
    groups:
    - name: openshift-ingress.rules
      rules:
      - alert: HAProxyReloadFail
        annotations:
          message: HAProxy reloads are failing on {{ $labels.pod }}. Router is not
            respecting recently created or modified routes
        expr: template_router_reload_failure == 1
        for: 5m
        labels:
          severity: warning
      - alert: HAProxyDown
        annotations:
          message: HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace
            }} / {{ $labels.pod }}
        expr: haproxy_up == 0
        for: 5m
        labels:
          severity: critical
      - alert: IngressControllerDegraded
        annotations:
          message: |
            The {{ $labels.namespace }}/{{ $labels.name }} ingresscontroller is
            degraded: {{ $labels.reason }}.
        expr: ingress_controller_conditions{condition="Degraded"} == 1
        for: 5m
        labels:
          severity: warning
      - alert: IngressControllerUnavailable
        annotations:
          message: |
            The {{ $labels.namespace }}/{{ $labels.name }} ingresscontroller is
            unavailable: {{ $labels.reason }}.
        expr: ingress_controller_conditions{condition="Available"} == 0
        for: 5m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 3 Ying Wang 2021-08-18 11:16:05 UTC
Checked on version below, prometheusrules for openshift-sdn and openshift-ovn have added summary annotation




lilia@liliadeMacBook-Pro mytest % oc get prometheusrules -n openshift-sdn -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    annotations:
      networkoperator.openshift.io/ignore-errors: ""
    creationTimestamp: "2021-08-18T10:18:25Z"
    generation: 1
    labels:
      prometheus: k8s
      role: alert-rules
    managedFields:
    - apiVersion: monitoring.coreos.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:networkoperator.openshift.io/ignore-errors: {}
          f:labels:
            .: {}
            f:prometheus: {}
            f:role: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"923b3b1a-ad7a-48b7-84f2-cf96afa79aaa"}: {}
        f:spec:
          .: {}
          f:groups: {}
      manager: cluster-network-operator
      operation: Update
      time: "2021-08-18T10:18:25Z"
    name: networking-rules
    namespace: openshift-sdn
    ownerReferences:
    - apiVersion: operator.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: Network
      name: cluster
      uid: 923b3b1a-ad7a-48b7-84f2-cf96afa79aaa
    resourceVersion: "2771"
    uid: 0d3e5958-76ee-4b19-9f1c-bb3e4e3fb81c
  spec:
    groups:
    - name: cluster-network-operator-sdn.rules
      rules:
      - alert: NodeWithoutSDNPod
        annotations:
          summary: All nodes should be running an sdn pod, {{ $labels.node }} is not.
        expr: |
          (kube_node_info unless on(node) topk by (node) (1, kube_pod_info{namespace="openshift-sdn",  pod=~"sdn.*"})) > 0
        for: 10m
        labels:
          severity: warning
      - alert: NodeProxyApplySlow
        annotations:
          summary: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is taking too long, on average, to apply kubernetes service rules to iptables.
        expr: "histogram_quantile(.95, kubeproxy_sync_proxy_rules_duration_seconds_bucket) \n* on(namespace, pod) group_right topk by (namespace, pod) (1, kube_pod_info{namespace=\"openshift-sdn\",  pod=~\"sdn-[^-]*\"}) > 15\n"
        labels:
          severity: warning
      - alert: ClusterProxyApplySlow
        annotations:
          summary: The cluster is taking too long, on average, to apply kubernetes service rules to iptables.
        expr: |
          histogram_quantile(0.95, sum(rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])) by (le)) > 10
        labels:
          severity: warning
      - alert: NodeProxyApplyStale
        annotations:
          summary: SDN pod {{ $labels.pod }} on node {{ $labels.node }} has stale kubernetes service rules in iptables.
        expr: |
          (kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds - kubeproxy_sync_proxy_rules_last_timestamp_seconds)
          * on(namespace, pod) group_right() topk by (namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"})
          > 30
        for: 5m
        labels:
          severity: warning
      - alert: SDNPodNotReady
        annotations:
          summary: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.
        expr: |
          kube_pod_status_ready{namespace='openshift-sdn', condition='true'} == 0
        for: 10m
        labels:
          severity: warning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 4 Ying Wang 2021-08-18 11:23:32 UTC
Add version for verification

lilia@liliadeMacBook-Pro mytest % oc version
Client Version: 4.7.5
Server Version: 4.9.0-0.nightly-2021-08-17-122812
Kubernetes Version: v1.22.0-rc.0+3dfed96

Comment 7 errata-xmlrpc 2021-10-18 17:45:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759