Bug 1992507
| Summary: | all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | hongyan li <hongyli> |
| Component: | Networking | Assignee: | Christoph Stäbler <cstabler> |
| Networking sub component: | openshift-sdn | QA Contact: | Ying Wang <yingwang> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | astoycos |
| Version: | 4.9 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:45:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The following rules have issue also
$ oc get prometheusrules -n openshift-ingress-operator -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
creationTimestamp: "2021-08-10T23:12:03Z"
generation: 1
labels:
role: alert-rules
name: ingress-operator
namespace: openshift-ingress-operator
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 9fc7b5b6-6c23-4335-be07-ecfe1b9a142f
resourceVersion: "1790"
uid: 0efb31c0-440b-408b-aee6-aba2fa472459
spec:
groups:
- name: openshift-ingress.rules
rules:
- alert: HAProxyReloadFail
annotations:
message: HAProxy reloads are failing on {{ $labels.pod }}. Router is not
respecting recently created or modified routes
expr: template_router_reload_failure == 1
for: 5m
labels:
severity: warning
- alert: HAProxyDown
annotations:
message: HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace
}} / {{ $labels.pod }}
expr: haproxy_up == 0
for: 5m
labels:
severity: critical
- alert: IngressControllerDegraded
annotations:
message: |
The {{ $labels.namespace }}/{{ $labels.name }} ingresscontroller is
degraded: {{ $labels.reason }}.
expr: ingress_controller_conditions{condition="Degraded"} == 1
for: 5m
labels:
severity: warning
- alert: IngressControllerUnavailable
annotations:
message: |
The {{ $labels.namespace }}/{{ $labels.name }} ingresscontroller is
unavailable: {{ $labels.reason }}.
expr: ingress_controller_conditions{condition="Available"} == 0
for: 5m
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Checked on version below, prometheusrules for openshift-sdn and openshift-ovn have added summary annotation
lilia@liliadeMacBook-Pro mytest % oc get prometheusrules -n openshift-sdn -oyaml
apiVersion: v1
items:
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
networkoperator.openshift.io/ignore-errors: ""
creationTimestamp: "2021-08-18T10:18:25Z"
generation: 1
labels:
prometheus: k8s
role: alert-rules
managedFields:
- apiVersion: monitoring.coreos.com/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:networkoperator.openshift.io/ignore-errors: {}
f:labels:
.: {}
f:prometheus: {}
f:role: {}
f:ownerReferences:
.: {}
k:{"uid":"923b3b1a-ad7a-48b7-84f2-cf96afa79aaa"}: {}
f:spec:
.: {}
f:groups: {}
manager: cluster-network-operator
operation: Update
time: "2021-08-18T10:18:25Z"
name: networking-rules
namespace: openshift-sdn
ownerReferences:
- apiVersion: operator.openshift.io/v1
blockOwnerDeletion: true
controller: true
kind: Network
name: cluster
uid: 923b3b1a-ad7a-48b7-84f2-cf96afa79aaa
resourceVersion: "2771"
uid: 0d3e5958-76ee-4b19-9f1c-bb3e4e3fb81c
spec:
groups:
- name: cluster-network-operator-sdn.rules
rules:
- alert: NodeWithoutSDNPod
annotations:
summary: All nodes should be running an sdn pod, {{ $labels.node }} is not.
expr: |
(kube_node_info unless on(node) topk by (node) (1, kube_pod_info{namespace="openshift-sdn", pod=~"sdn.*"})) > 0
for: 10m
labels:
severity: warning
- alert: NodeProxyApplySlow
annotations:
summary: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is taking too long, on average, to apply kubernetes service rules to iptables.
expr: "histogram_quantile(.95, kubeproxy_sync_proxy_rules_duration_seconds_bucket) \n* on(namespace, pod) group_right topk by (namespace, pod) (1, kube_pod_info{namespace=\"openshift-sdn\", pod=~\"sdn-[^-]*\"}) > 15\n"
labels:
severity: warning
- alert: ClusterProxyApplySlow
annotations:
summary: The cluster is taking too long, on average, to apply kubernetes service rules to iptables.
expr: |
histogram_quantile(0.95, sum(rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])) by (le)) > 10
labels:
severity: warning
- alert: NodeProxyApplyStale
annotations:
summary: SDN pod {{ $labels.pod }} on node {{ $labels.node }} has stale kubernetes service rules in iptables.
expr: |
(kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds - kubeproxy_sync_proxy_rules_last_timestamp_seconds)
* on(namespace, pod) group_right() topk by (namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"})
> 30
for: 5m
labels:
severity: warning
- alert: SDNPodNotReady
annotations:
summary: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.
expr: |
kube_pod_status_ready{namespace='openshift-sdn', condition='true'} == 0
for: 10m
labels:
severity: warning
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Add version for verification lilia@liliadeMacBook-Pro mytest % oc version Client Version: 4.7.5 Server Version: 4.9.0-0.nightly-2021-08-17-122812 Kubernetes Version: v1.22.0-rc.0+3dfed96 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |
Description of problem: all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-07-175228 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: $ oc get prometheusrules -n openshift-sdn -oyaml apiVersion: v1 items: - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: networkoperator.openshift.io/ignore-errors: "" creationTimestamp: "2021-08-10T23:12:52Z" generation: 1 labels: prometheus: k8s role: alert-rules name: networking-rules namespace: openshift-sdn ownerReferences: - apiVersion: operator.openshift.io/v1 blockOwnerDeletion: true controller: true kind: Network name: cluster uid: f3f79f33-0ad6-4115-bb39-3dbd18324808 resourceVersion: "2834" uid: 92537d98-cb35-4f74-9291-e1b6f3952277 spec: groups: - name: cluster-network-operator-sdn.rules rules: - alert: NodeWithoutSDNPod annotations: message: | All nodes should be running an sdn pod, {{ $labels.node }} is not. expr: | (kube_node_info unless on(node) topk by (node) (1, kube_pod_info{namespace="openshift-sdn", pod=~"sdn.*"})) > 0 for: 10m labels: severity: warning - alert: NodeProxyApplySlow annotations: message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is taking too long, on average, to apply kubernetes service rules to iptables. expr: "histogram_quantile(.95, kubeproxy_sync_proxy_rules_duration_seconds_bucket) \n* on(namespace, pod) group_right topk by (namespace, pod) (1, kube_pod_info{namespace=\"openshift-sdn\", \ pod=~\"sdn-[^-]*\"}) > 15\n" labels: severity: warning - alert: ClusterProxyApplySlow annotations: message: The cluster is taking too long, on average, to apply kubernetes service rules to iptables. expr: | histogram_quantile(0.95, sum(rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m])) by (le)) > 10 labels: severity: warning - alert: NodeProxyApplyStale annotations: message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} has stale kubernetes service rules in iptables. expr: | (kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds - kubeproxy_sync_proxy_rules_last_timestamp_seconds) * on(namespace, pod) group_right() topk by (namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"}) > 30 for: 5m labels: severity: warning - alert: SDNPodNotReady annotations: message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready. expr: | kube_pod_status_ready{namespace='openshift-sdn', condition='true'} == 0 for: 10m labels: severity: warning kind: List metadata: resourceVersion: "" selfLink: "" Expected results: alert rules have annotations "summary" and "description" Additional info: the "summary" and "description" annotations comply with the OpenShift alerting guidelines [1] [1] https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required