Bug 1826339

Summary:	kube-proxy stale alerts incorrectly firing.
Product:	OpenShift Container Platform	Reporter:	Casey Callendrello <cdc>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, bbennett, ccoleman, juzhao, kgarriso, yhe
Version:	4.5
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SDN-CI-IMPACT
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:29:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Casey Callendrello 2020-04-21 13:17:24 UTC

Now that the changes to reduce unnecessary iptables syncs have landed, we're now firing alerts unnecessarily. This is because we no longer sync every 30 seconds, even when not needed.

Fix those alerts, and figure out if we can write good alerts.

Comment 1 Casey Callendrello 2020-04-21 13:17:51 UTC

Filed https://github.com/kubernetes/kubernetes/pull/90175 to get the metrics we need in to kube-proxy.

Comment 2 Casey Callendrello 2020-05-06 16:31:26 UTC

*** Bug 1832272 has been marked as a duplicate of this bug. ***

Comment 3 Casey Callendrello 2020-05-06 16:32:12 UTC

*** Bug 1830098 has been marked as a duplicate of this bug. ***

Comment 4 Casey Callendrello 2020-05-06 17:45:37 UTC

Next step: pr https://github.com/openshift/sdn/pull/138 to pull the upstream change to sdn.

Comment 9 zhaozhanqi 2020-05-20 08:23:09 UTC

Verified this bug on 4.5.0-0.nightly-2020-05-19-041951

alert: ClusterProxyApplySlow
expr: histogram_quantile(0.95,
  sum by(le) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) >
  10
labels:
  severity: warning
annotations:
  message: The cluster is taking too long, on average, to apply kubernetes service
    rules to iptables.
OK		8.71s ago	1.083ms
alert: NodeProxyApplyStale
expr: (kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds
  - kubeproxy_sync_proxy_rules_last_timestamp_seconds) * on(namespace, pod) group_right()
  topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"})
  > 30
for: 5m
labels:
  severity: warning
annotations:
  message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} has stale kubernetes
    service rules in iptables.
OK		8.709s ago	424.1us
alert: SDNPodNotReady
expr: kube_pod_status_ready{condition="true",namespace="openshift-sdn"}
  == 0
for: 10m
labels:
  severity: warning
annotations:
  message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.

Comment 10 errata-xmlrpc 2020-07-13 17:29:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409