Description of problem: It may be good to re-think alerts related to kubeproxy_sync_proxy_rules_duration_seconds_bucket metric. For example, NodeProxyApplySlow works on the percentile of the whole histogram, so it needs enough "bad values" to accumulate so that the percentiles grows higher than 10s to fire and it also needs to wait until too many "good values" accumulate so that the percentile goes under 10s. That makes it have unreasonably slow reaction times. Version-Release number of selected component (if applicable): 4.8.20 How reproducible: Always Steps to Reproduce: 1. Alert firing Actual results: Low reaction time, alert doesn't cool down even if it should. Expected results: Alert to reflect a reality nearer to the instant values of the kube proxy sync time. Additional info:
We don't have the resources to complete this request right now.
Closing due to lack of resources to solve this low priority issue.
To add a little context: Given the performance improvements we see from https://bugzilla.redhat.com/show_bug.cgi?id=2058444, we closed this because it was low priority, and addressing the root cause seemed more appropriate. However, given that we are hiding useful alerts, I bumped the priority and reopened the bug.
Added initial patch upstream to improve sensitivity of alert NodeProxyApplySlow.
Martin Kennelly Could you help give some suggestion to verify this bug? I guess we need create a lot of services
test yaml file: cat list.json { "apiVersion": "v1", "kind": "List", "items": [ { "apiVersion": "v1", "kind": "ReplicationController", "metadata": { "labels": { "name": "test-rc" }, "name": "test-rc" }, "spec": { "replicas": 30, "template": { "metadata": { "labels": { "name": "test-pods" } }, "spec": { "containers": [ { "image": "quay.io/openshifttest/hello-sdn@sha256:2af5b5ec480f05fda7e9b278023ba04724a3dd53a296afcd8c13f220dec52197", "name": "test-pod", "imagePullPolicy": "IfNotPresent", "resources":{ "limits":{ "memory":"340Mi" } } } ] } } } }, { "apiVersion": "v1", "kind": "Service", "metadata": { "labels": { "name": "test-service" }, "name": "test-service" }, "spec": { "ports": [ { "name": "http", "port": 27017, "protocol": "TCP", "targetPort": 8080 } ], "selector": { "name": "test-pods" } } } ] } After apply above json file: 2. with the following script to create 2000 service and i=0 while [ $i -le 2000 ] do echo ' { "apiVersion": "v1", "kind": "Service", "metadata": { "labels": { "name": "test-service" }, "name": '\"test-service-$i\"' }, "spec": { "ports": [ { "name": "http", "port": 27017, "protocol": "TCP", "targetPort": 8080 } ], "selector": { "name": "test-pods" } } } ' | oc create -f - i=$(($i+1)) done 3. Then from alert console we can see this alert 'NodeProxyApplySlow' histogram_quantile(0.95, sum by(le, namespace, pod) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) * on(namespace, pod) group_right() topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"}) > 15 kube-rbac-proxy-main DaemonSet sdn https-main 10.0.132.209 true kube-state-metrics openshift-sdn ip-10-0-132-209.us-east-2.compute.internal sdn-rwz74 10.0.132.209 system-node-critical openshift-monitoring/k8s kube-state-metrics 461ee347-6345-4fde-be38-e4341e6d3842 15.7696 kube-rbac-proxy-main DaemonSet sdn https-main 10.0.132.224 true kube-state-metrics openshift-sdn ip-10-0-132-224.us-east-2.compute.internal sdn-r6nhv 10.0.132.224 system-node-critical openshift-monitoring/k8s kube-state-metrics 56b95e52-c7d6-4ccc-bcb2-3cff67593ec6 15.769599999999999 4. Then scale up test pod to 1 to remove this alert oc scale rc test-rc --replicas=1. 5. After 5mins. the alert 'NodeProxyApplySlow' was removed.
append the test version 4.12.0-0.nightly-2022-07-11-015414
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399