Bug 1884258

Summary: Node network alerts should work on ratio rather than absolute values
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.6CC: alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, surbania
Target Milestone: ---Keywords: UpcomingSprint
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:22:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Pasquier 2020-10-01 13:10:41 UTC
Both NodeNetworkTransmitErrs and NodeNetworkTransmitErrs alerts fire when more than 10 errors happen in the last 2 minutes. Depending on the amount of network traffic, the alerts might be too noisy. It would be better to measure errors against the total amount of traffic.

See https://github.com/openshift/cluster-monitoring-operator/issues/937#issuecomment-698191872

Comment 4 Junqi Zhao 2020-10-27 02:59:25 UTC
(In reply to Simon Pasquier from comment #0)
> Both NodeNetworkTransmitErrs and NodeNetworkTransmitErrs alerts fire when
> more than 10 errors happen in the last 2 minutes. 
should be NodeNetworkReceiveErrs and NodeNetworkTransmitErrs

Comment 5 Junqi Zhao 2020-10-27 03:10:47 UTC
tested with 4.7.0-0.nightly-2020-10-26-152308, expr for NodeNetworkReceiveErrs and NodeNetworkTransmitErrs alerts are measured errors against the total amount of traffic.

alert: NodeNetworkTransmitErrs
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 1h
labels:
  severity: warning
annotations:
  description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
  summary: Network interface is reporting many transmit errors.

alert: NodeNetworkReceiveErrs
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 1h
labels:
  severity: warning
annotations:
  description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.'
  summary: Network interface is reporting many receive errors.

Comment 9 errata-xmlrpc 2021-02-24 15:22:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633