1826339 – kube-proxy stale alerts incorrectly firing.

Bug 1826339 - kube-proxy stale alerts incorrectly firing.

Summary: kube-proxy stale alerts incorrectly firing.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:	SDN-CI-IMPACT
Duplicates (2):	1830098 1832272 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-21 13:17 UTC by Casey Callendrello
Modified:	2024-06-13 22:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:29:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 635	None	closed	Bug 1826339: openshift-sdn: rethink kube-proxy rules, fix spurious alerts	2021-01-26 09:04:38 UTC
Github	openshift sdn pull 138	None	closed	Bug 1826339: vendor: bump our k8s vendor	2021-01-26 09:05:22 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:29:56 UTC

Description Casey Callendrello 2020-04-21 13:17:24 UTC

Now that the changes to reduce unnecessary iptables syncs have landed, we're now firing alerts unnecessarily. This is because we no longer sync every 30 seconds, even when not needed.

Fix those alerts, and figure out if we can write good alerts.

Comment 1 Casey Callendrello 2020-04-21 13:17:51 UTC

Filed https://github.com/kubernetes/kubernetes/pull/90175 to get the metrics we need in to kube-proxy.

Comment 2 Casey Callendrello 2020-05-06 16:31:26 UTC

*** Bug 1832272 has been marked as a duplicate of this bug. ***

Comment 3 Casey Callendrello 2020-05-06 16:32:12 UTC

*** Bug 1830098 has been marked as a duplicate of this bug. ***

Comment 4 Casey Callendrello 2020-05-06 17:45:37 UTC

Next step: pr https://github.com/openshift/sdn/pull/138 to pull the upstream change to sdn.

Comment 9 zhaozhanqi 2020-05-20 08:23:09 UTC

Verified this bug on 4.5.0-0.nightly-2020-05-19-041951

alert: ClusterProxyApplySlow
expr: histogram_quantile(0.95,
  sum by(le) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) >
  10
labels:
  severity: warning
annotations:
  message: The cluster is taking too long, on average, to apply kubernetes service
    rules to iptables.
OK		8.71s ago	1.083ms
alert: NodeProxyApplyStale
expr: (kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds
  - kubeproxy_sync_proxy_rules_last_timestamp_seconds) * on(namespace, pod) group_right()
  topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"})
  > 30
for: 5m
labels:
  severity: warning
annotations:
  message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} has stale kubernetes
    service rules in iptables.
OK		8.709s ago	424.1us
alert: SDNPodNotReady
expr: kube_pod_status_ready{condition="true",namespace="openshift-sdn"}
  == 0
for: 10m
labels:
  severity: warning
annotations:
  message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.

Comment 10 errata-xmlrpc 2020-07-13 17:29:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.