Bug 2060079
| Summary: | Re-think kubeproxy_sync_proxy_rules_duration_seconds_bucket alerts | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pablo Alonso Rodriguez <palonsor> |
| Component: | Networking | Assignee: | Martin Kennelly <mkennell> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | urgent | CC: | bbennett, mkennell, rravaiol |
| Version: | 4.8 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-17 19:47:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Pablo Alonso Rodriguez
2022-03-02 16:38:22 UTC
We don't have the resources to complete this request right now. Closing due to lack of resources to solve this low priority issue. To add a little context: Given the performance improvements we see from https://bugzilla.redhat.com/show_bug.cgi?id=2058444, we closed this because it was low priority, and addressing the root cause seemed more appropriate. However, given that we are hiding useful alerts, I bumped the priority and reopened the bug. Added initial patch upstream to improve sensitivity of alert NodeProxyApplySlow. Martin Kennelly Could you help give some suggestion to verify this bug? I guess we need create a lot of services test yaml file:
cat list.json
{
"apiVersion": "v1",
"kind": "List",
"items": [
{
"apiVersion": "v1",
"kind": "ReplicationController",
"metadata": {
"labels": {
"name": "test-rc"
},
"name": "test-rc"
},
"spec": {
"replicas": 30,
"template": {
"metadata": {
"labels": {
"name": "test-pods"
}
},
"spec": {
"containers": [
{
"image": "quay.io/openshifttest/hello-sdn@sha256:2af5b5ec480f05fda7e9b278023ba04724a3dd53a296afcd8c13f220dec52197",
"name": "test-pod",
"imagePullPolicy": "IfNotPresent",
"resources":{
"limits":{
"memory":"340Mi"
}
}
}
]
}
}
}
},
{
"apiVersion": "v1",
"kind": "Service",
"metadata": {
"labels": {
"name": "test-service"
},
"name": "test-service"
},
"spec": {
"ports": [
{
"name": "http",
"port": 27017,
"protocol": "TCP",
"targetPort": 8080
}
],
"selector": {
"name": "test-pods"
}
}
}
]
}
After apply above json file:
2. with the following script to create 2000 service and
i=0
while [ $i -le 2000 ]
do
echo '
{
"apiVersion": "v1",
"kind": "Service",
"metadata": {
"labels": {
"name": "test-service"
},
"name": '\"test-service-$i\"'
},
"spec": {
"ports": [
{
"name": "http",
"port": 27017,
"protocol": "TCP",
"targetPort": 8080
}
],
"selector": {
"name": "test-pods"
}
}
}
' | oc create -f -
i=$(($i+1))
done
3. Then from alert console we can see this alert 'NodeProxyApplySlow'
histogram_quantile(0.95, sum by(le, namespace, pod) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) * on(namespace, pod) group_right() topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"}) > 15
kube-rbac-proxy-main DaemonSet sdn https-main 10.0.132.209 true kube-state-metrics openshift-sdn ip-10-0-132-209.us-east-2.compute.internal sdn-rwz74 10.0.132.209 system-node-critical openshift-monitoring/k8s kube-state-metrics 461ee347-6345-4fde-be38-e4341e6d3842 15.7696
kube-rbac-proxy-main DaemonSet sdn https-main 10.0.132.224 true kube-state-metrics openshift-sdn ip-10-0-132-224.us-east-2.compute.internal sdn-r6nhv 10.0.132.224 system-node-critical openshift-monitoring/k8s kube-state-metrics 56b95e52-c7d6-4ccc-bcb2-3cff67593ec6 15.769599999999999
4. Then scale up test pod to 1 to remove this alert
oc scale rc test-rc --replicas=1.
5. After 5mins. the alert 'NodeProxyApplySlow' was removed.
append the test version 4.12.0-0.nightly-2022-07-11-015414 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |