Bug 2060079 - Re-think kubeproxy_sync_proxy_rules_duration_seconds_bucket alerts
Summary: Re-think kubeproxy_sync_proxy_rules_duration_seconds_bucket alerts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: 4.12.0
Assignee: Martin Kennelly
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-02 16:38 UTC by Pablo Alonso Rodriguez
Modified: 2023-01-17 19:48 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:47:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1491 0 None open WIP: Bug 2060079: Enhance sensitivity of SDN alert NodeProxyApplySlow 2022-06-17 13:32:48 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:48:02 UTC

Description Pablo Alonso Rodriguez 2022-03-02 16:38:22 UTC
Description of problem:

It may be good to re-think alerts related to  kubeproxy_sync_proxy_rules_duration_seconds_bucket metric.

For example, NodeProxyApplySlow works on the percentile of the whole histogram, so it needs enough "bad values" to accumulate so that the percentiles grows higher than 10s to fire and it also needs to wait until too many "good values" accumulate so that the percentile goes under 10s. That makes it have unreasonably slow reaction times.

Version-Release number of selected component (if applicable):

4.8.20

How reproducible:

Always

Steps to Reproduce:
1. Alert firing

Actual results:

Low reaction time, alert doesn't cool down even if it should.

Expected results:

Alert to reflect a reality nearer to the instant values of the kube proxy sync time.

Additional info:

Comment 1 Martin Kennelly 2022-05-03 10:16:29 UTC
We don't have the resources to complete this request right now.

Comment 2 Martin Kennelly 2022-05-25 11:24:58 UTC
Closing due to lack of resources to solve this low priority issue.

Comment 3 Ben Bennett 2022-06-16 14:20:16 UTC
To add a little context: Given the performance improvements we see from https://bugzilla.redhat.com/show_bug.cgi?id=2058444, we closed this because it was low priority, and addressing the root cause seemed more appropriate.  However, given that we are hiding useful alerts, I bumped the priority and reopened the bug.

Comment 4 Martin Kennelly 2022-06-17 13:32:48 UTC
Added initial patch upstream to improve sensitivity of alert NodeProxyApplySlow.

Comment 7 zhaozhanqi 2022-07-07 10:19:22 UTC
Martin Kennelly Could you help give some suggestion to verify this bug? I guess we need create a lot of services

Comment 9 zhaozhanqi 2022-07-11 10:19:33 UTC
test yaml file:

cat list.json

{
    "apiVersion": "v1",
    "kind": "List",
    "items": [
        {
            "apiVersion": "v1",
            "kind": "ReplicationController",
            "metadata": {
                "labels": {
                    "name": "test-rc"
                },
                "name": "test-rc"
            },
            "spec": {
                "replicas": 30,
                "template": {
                    "metadata": {
                        "labels": {
                            "name": "test-pods"
                        }
                    },
                    "spec": {
                        "containers": [
                            {
                                "image": "quay.io/openshifttest/hello-sdn@sha256:2af5b5ec480f05fda7e9b278023ba04724a3dd53a296afcd8c13f220dec52197",
                                "name": "test-pod",
                                "imagePullPolicy": "IfNotPresent",
                                "resources":{
                                  "limits":{
                                    "memory":"340Mi"
                                  }
                                }
                            }
                        ]
                    }
                }
            }
        },
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": "test-service"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }
    ]
}


After apply above json file: 

2.  with the following script to create 2000 service and 

i=0

while [ $i -le 2000 ]

do 

echo '

        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": '\"test-service-$i\"'
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }
' | oc create -f -

i=$(($i+1))
done



3.  Then from alert console we can see this alert 'NodeProxyApplySlow' 

histogram_quantile(0.95, sum by(le, namespace, pod) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) * on(namespace, pod) group_right() topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"}) > 15

kube-rbac-proxy-main	DaemonSet	sdn	https-main	10.0.132.209	true	kube-state-metrics	openshift-sdn	ip-10-0-132-209.us-east-2.compute.internal	sdn-rwz74	10.0.132.209	system-node-critical	openshift-monitoring/k8s	kube-state-metrics	461ee347-6345-4fde-be38-e4341e6d3842	15.7696
kube-rbac-proxy-main	DaemonSet	sdn	https-main	10.0.132.224	true	kube-state-metrics	openshift-sdn	ip-10-0-132-224.us-east-2.compute.internal	sdn-r6nhv	10.0.132.224	system-node-critical	openshift-monitoring/k8s	kube-state-metrics	56b95e52-c7d6-4ccc-bcb2-3cff67593ec6	15.769599999999999


4. Then scale up test pod to 1 to remove this alert

oc scale rc test-rc --replicas=1. 


5.  After 5mins. the alert 'NodeProxyApplySlow' was removed.

Comment 10 zhaozhanqi 2022-07-11 10:20:16 UTC
append the test version 4.12.0-0.nightly-2022-07-11-015414

Comment 14 errata-xmlrpc 2023-01-17 19:47:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.