2060079 – Re-think kubeproxy_sync_proxy_rules_duration_seconds_bucket alerts

Bug 2060079 - Re-think kubeproxy_sync_proxy_rules_duration_seconds_bucket alerts

Summary: Re-think kubeproxy_sync_proxy_rules_duration_seconds_bucket alerts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Martin Kennelly
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-02 16:38 UTC by Pablo Alonso Rodriguez
Modified:	2023-01-17 19:48 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:47:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1491	0	None	open	WIP: Bug 2060079: Enhance sensitivity of SDN alert NodeProxyApplySlow	2022-06-17 13:32:48 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:48:02 UTC

Description Pablo Alonso Rodriguez 2022-03-02 16:38:22 UTC

Description of problem:

It may be good to re-think alerts related to  kubeproxy_sync_proxy_rules_duration_seconds_bucket metric.

For example, NodeProxyApplySlow works on the percentile of the whole histogram, so it needs enough "bad values" to accumulate so that the percentiles grows higher than 10s to fire and it also needs to wait until too many "good values" accumulate so that the percentile goes under 10s. That makes it have unreasonably slow reaction times.

Version-Release number of selected component (if applicable):

4.8.20

How reproducible:

Always

Steps to Reproduce:
1. Alert firing

Actual results:

Low reaction time, alert doesn't cool down even if it should.

Expected results:

Alert to reflect a reality nearer to the instant values of the kube proxy sync time.

Additional info:

Comment 1 Martin Kennelly 2022-05-03 10:16:29 UTC

We don't have the resources to complete this request right now.

Comment 2 Martin Kennelly 2022-05-25 11:24:58 UTC

Closing due to lack of resources to solve this low priority issue.

Comment 3 Ben Bennett 2022-06-16 14:20:16 UTC

To add a little context: Given the performance improvements we see from https://bugzilla.redhat.com/show_bug.cgi?id=2058444, we closed this because it was low priority, and addressing the root cause seemed more appropriate.  However, given that we are hiding useful alerts, I bumped the priority and reopened the bug.

Comment 4 Martin Kennelly 2022-06-17 13:32:48 UTC

Added initial patch upstream to improve sensitivity of alert NodeProxyApplySlow.

Comment 7 zhaozhanqi 2022-07-07 10:19:22 UTC

Martin Kennelly Could you help give some suggestion to verify this bug? I guess we need create a lot of services

Comment 9 zhaozhanqi 2022-07-11 10:19:33 UTC

test yaml file:

cat list.json

{
    "apiVersion": "v1",
    "kind": "List",
    "items": [
        {
            "apiVersion": "v1",
            "kind": "ReplicationController",
            "metadata": {
                "labels": {
                    "name": "test-rc"
                },
                "name": "test-rc"
            },
            "spec": {
                "replicas": 30,
                "template": {
                    "metadata": {
                        "labels": {
                            "name": "test-pods"
                        }
                    },
                    "spec": {
                        "containers": [
                            {
                                "image": "quay.io/openshifttest/hello-sdn@sha256:2af5b5ec480f05fda7e9b278023ba04724a3dd53a296afcd8c13f220dec52197",
                                "name": "test-pod",
                                "imagePullPolicy": "IfNotPresent",
                                "resources":{
                                  "limits":{
                                    "memory":"340Mi"
                                  }
                                }
                            }
                        ]
                    }
                }
            }
        },
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": "test-service"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }
    ]
}


After apply above json file: 

2.  with the following script to create 2000 service and 

i=0

while [ $i -le 2000 ]

do 

echo '

        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": '\"test-service-$i\"'
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }
' | oc create -f -

i=$(($i+1))
done



3.  Then from alert console we can see this alert 'NodeProxyApplySlow' 

histogram_quantile(0.95, sum by(le, namespace, pod) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) * on(namespace, pod) group_right() topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"}) > 15

kube-rbac-proxy-main	DaemonSet	sdn	https-main	10.0.132.209	true	kube-state-metrics	openshift-sdn	ip-10-0-132-209.us-east-2.compute.internal	sdn-rwz74	10.0.132.209	system-node-critical	openshift-monitoring/k8s	kube-state-metrics	461ee347-6345-4fde-be38-e4341e6d3842	15.7696
kube-rbac-proxy-main	DaemonSet	sdn	https-main	10.0.132.224	true	kube-state-metrics	openshift-sdn	ip-10-0-132-224.us-east-2.compute.internal	sdn-r6nhv	10.0.132.224	system-node-critical	openshift-monitoring/k8s	kube-state-metrics	56b95e52-c7d6-4ccc-bcb2-3cff67593ec6	15.769599999999999


4. Then scale up test pod to 1 to remove this alert

oc scale rc test-rc --replicas=1. 


5.  After 5mins. the alert 'NodeProxyApplySlow' was removed.

Comment 10 zhaozhanqi 2022-07-11 10:20:16 UTC

append the test version 4.12.0-0.nightly-2022-07-11-015414

Comment 14 errata-xmlrpc 2023-01-17 19:47:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.