Bug 2060079

Summary:	Re-think kubeproxy_sync_proxy_rules_duration_seconds_bucket alerts
Product:	OpenShift Container Platform	Reporter:	Pablo Alonso Rodriguez <palonsor>
Component:	Networking	Assignee:	Martin Kennelly <mkennell>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	urgent	CC:	bbennett, mkennell, rravaiol
Version:	4.8	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:47:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pablo Alonso Rodriguez 2022-03-02 16:38:22 UTC

Description of problem:

It may be good to re-think alerts related to  kubeproxy_sync_proxy_rules_duration_seconds_bucket metric.

For example, NodeProxyApplySlow works on the percentile of the whole histogram, so it needs enough "bad values" to accumulate so that the percentiles grows higher than 10s to fire and it also needs to wait until too many "good values" accumulate so that the percentile goes under 10s. That makes it have unreasonably slow reaction times.

Version-Release number of selected component (if applicable):

4.8.20

How reproducible:

Always

Steps to Reproduce:
1. Alert firing

Actual results:

Low reaction time, alert doesn't cool down even if it should.

Expected results:

Alert to reflect a reality nearer to the instant values of the kube proxy sync time.

Additional info:

Comment 1 Martin Kennelly 2022-05-03 10:16:29 UTC

We don't have the resources to complete this request right now.

Comment 2 Martin Kennelly 2022-05-25 11:24:58 UTC

Closing due to lack of resources to solve this low priority issue.

Comment 3 Ben Bennett 2022-06-16 14:20:16 UTC

To add a little context: Given the performance improvements we see from https://bugzilla.redhat.com/show_bug.cgi?id=2058444, we closed this because it was low priority, and addressing the root cause seemed more appropriate.  However, given that we are hiding useful alerts, I bumped the priority and reopened the bug.

Comment 4 Martin Kennelly 2022-06-17 13:32:48 UTC

Added initial patch upstream to improve sensitivity of alert NodeProxyApplySlow.

Comment 7 zhaozhanqi 2022-07-07 10:19:22 UTC

Martin Kennelly Could you help give some suggestion to verify this bug? I guess we need create a lot of services

Comment 9 zhaozhanqi 2022-07-11 10:19:33 UTC

test yaml file:

cat list.json

{
    "apiVersion": "v1",
    "kind": "List",
    "items": [
        {
            "apiVersion": "v1",
            "kind": "ReplicationController",
            "metadata": {
                "labels": {
                    "name": "test-rc"
                },
                "name": "test-rc"
            },
            "spec": {
                "replicas": 30,
                "template": {
                    "metadata": {
                        "labels": {
                            "name": "test-pods"
                        }
                    },
                    "spec": {
                        "containers": [
                            {
                                "image": "quay.io/openshifttest/hello-sdn@sha256:2af5b5ec480f05fda7e9b278023ba04724a3dd53a296afcd8c13f220dec52197",
                                "name": "test-pod",
                                "imagePullPolicy": "IfNotPresent",
                                "resources":{
                                  "limits":{
                                    "memory":"340Mi"
                                  }
                                }
                            }
                        ]
                    }
                }
            }
        },
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": "test-service"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }
    ]
}


After apply above json file: 

2.  with the following script to create 2000 service and 

i=0

while [ $i -le 2000 ]

do 

echo '

        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": '\"test-service-$i\"'
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }
' | oc create -f -

i=$(($i+1))
done



3.  Then from alert console we can see this alert 'NodeProxyApplySlow' 

histogram_quantile(0.95, sum by(le, namespace, pod) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) * on(namespace, pod) group_right() topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"}) > 15

kube-rbac-proxy-main	DaemonSet	sdn	https-main	10.0.132.209	true	kube-state-metrics	openshift-sdn	ip-10-0-132-209.us-east-2.compute.internal	sdn-rwz74	10.0.132.209	system-node-critical	openshift-monitoring/k8s	kube-state-metrics	461ee347-6345-4fde-be38-e4341e6d3842	15.7696
kube-rbac-proxy-main	DaemonSet	sdn	https-main	10.0.132.224	true	kube-state-metrics	openshift-sdn	ip-10-0-132-224.us-east-2.compute.internal	sdn-r6nhv	10.0.132.224	system-node-critical	openshift-monitoring/k8s	kube-state-metrics	56b95e52-c7d6-4ccc-bcb2-3cff67593ec6	15.769599999999999


4. Then scale up test pod to 1 to remove this alert

oc scale rc test-rc --replicas=1. 


5.  After 5mins. the alert 'NodeProxyApplySlow' was removed.

Comment 10 zhaozhanqi 2022-07-11 10:20:16 UTC

append the test version 4.12.0-0.nightly-2022-07-11-015414

Comment 14 errata-xmlrpc 2023-01-17 19:47:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399