Bug 1826339 - kube-proxy stale alerts incorrectly firing.
Summary: kube-proxy stale alerts incorrectly firing.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
URL:
Whiteboard: SDN-CI-IMPACT
: 1830098 1832272 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-21 13:17 UTC by Casey Callendrello
Modified: 2021-07-30 06:30 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:29:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 635 0 None closed Bug 1826339: openshift-sdn: rethink kube-proxy rules, fix spurious alerts 2021-01-26 09:04:38 UTC
Github openshift sdn pull 138 0 None closed Bug 1826339: vendor: bump our k8s vendor 2021-01-26 09:05:22 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:29:56 UTC

Description Casey Callendrello 2020-04-21 13:17:24 UTC
Now that the changes to reduce unnecessary iptables syncs have landed, we're now firing alerts unnecessarily. This is because we no longer sync every 30 seconds, even when not needed.

Fix those alerts, and figure out if we can write good alerts.

Comment 1 Casey Callendrello 2020-04-21 13:17:51 UTC
Filed https://github.com/kubernetes/kubernetes/pull/90175 to get the metrics we need in to kube-proxy.

Comment 2 Casey Callendrello 2020-05-06 16:31:26 UTC
*** Bug 1832272 has been marked as a duplicate of this bug. ***

Comment 3 Casey Callendrello 2020-05-06 16:32:12 UTC
*** Bug 1830098 has been marked as a duplicate of this bug. ***

Comment 4 Casey Callendrello 2020-05-06 17:45:37 UTC
Next step: pr https://github.com/openshift/sdn/pull/138 to pull the upstream change to sdn.

Comment 9 zhaozhanqi 2020-05-20 08:23:09 UTC
Verified this bug on 4.5.0-0.nightly-2020-05-19-041951

alert: ClusterProxyApplySlow
expr: histogram_quantile(0.95,
  sum by(le) (rate(kubeproxy_sync_proxy_rules_duration_seconds_bucket[5m]))) >
  10
labels:
  severity: warning
annotations:
  message: The cluster is taking too long, on average, to apply kubernetes service
    rules to iptables.
OK		8.71s ago	1.083ms
alert: NodeProxyApplyStale
expr: (kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds
  - kubeproxy_sync_proxy_rules_last_timestamp_seconds) * on(namespace, pod) group_right()
  topk by(namespace, pod) (1, kube_pod_info{namespace="openshift-sdn",pod=~"sdn-[^-]*"})
  > 30
for: 5m
labels:
  severity: warning
annotations:
  message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} has stale kubernetes
    service rules in iptables.
OK		8.709s ago	424.1us
alert: SDNPodNotReady
expr: kube_pod_status_ready{condition="true",namespace="openshift-sdn"}
  == 0
for: 10m
labels:
  severity: warning
annotations:
  message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.

Comment 10 errata-xmlrpc 2020-07-13 17:29:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.