Bug 1884258 - Node network alerts should work on ratio rather than absolute values
Summary: Node network alerts should work on ratio rather than absolute values
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-01 13:10 UTC by Simon Pasquier
Modified: 2021-02-24 15:22 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:22:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 963 0 None closed Bug 1890808: bump mixins to include new etcd alerts 2021-02-02 02:12:46 UTC
Github prometheus node_exporter pull 1861 0 None closed docs/node-mixin/alerts: use ratio for network alerts 2021-02-02 02:12:46 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:22:54 UTC

Description Simon Pasquier 2020-10-01 13:10:41 UTC
Both NodeNetworkTransmitErrs and NodeNetworkTransmitErrs alerts fire when more than 10 errors happen in the last 2 minutes. Depending on the amount of network traffic, the alerts might be too noisy. It would be better to measure errors against the total amount of traffic.

See https://github.com/openshift/cluster-monitoring-operator/issues/937#issuecomment-698191872

Comment 4 Junqi Zhao 2020-10-27 02:59:25 UTC
(In reply to Simon Pasquier from comment #0)
> Both NodeNetworkTransmitErrs and NodeNetworkTransmitErrs alerts fire when
> more than 10 errors happen in the last 2 minutes. 
should be NodeNetworkReceiveErrs and NodeNetworkTransmitErrs

Comment 5 Junqi Zhao 2020-10-27 03:10:47 UTC
tested with 4.7.0-0.nightly-2020-10-26-152308, expr for NodeNetworkReceiveErrs and NodeNetworkTransmitErrs alerts are measured errors against the total amount of traffic.

alert: NodeNetworkTransmitErrs
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 1h
labels:
  severity: warning
annotations:
  description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
  summary: Network interface is reporting many transmit errors.

alert: NodeNetworkReceiveErrs
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 1h
labels:
  severity: warning
annotations:
  description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.'
  summary: Network interface is reporting many receive errors.

Comment 9 errata-xmlrpc 2021-02-24 15:22:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.