Bug 1888381

Summary: instance:node_network_receive_bytes_excluding_lo:rate1m value twice expected
Product: OpenShift Container Platform Reporter: Keith Wall <kwall>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.5CC: alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, spasquie, surbania
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:26:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
NetUtilisatiuonUseMethodCluster
none
NetUtilisationUseMethodNode
none
instance_node_network_receive_bytes_excluding_lo_rate
none
iperf-yamls.tar.gz
none
Capture requested by Simon 1
none
Capture requested by Simon 2
none
Capture requested by Simon 3
none
USE Method / Node dashboard
none
USE Method / Cluster dashboard none

Description Keith Wall 2020-10-14 17:44:01 UTC
OSD: 4.5.13

Description of problem:

I ran iperf3 arranging the deployments so that server and client pod were on separate nodes so that a know amount of network traffic was generated (4.95Gbit/s, 618Mi/s).

Examining the "USE Method / Node" dashboard and "USE Method / Cluster" I noticed that the Network Utilisation for the Received traffic is twice that expected (1.227Gi/s).  The Transmit figures are as expected (611Mi/s). 

By querying Prometheus directly, I see that instance:node_network_receive_bytes_excluding_lo:rate1m that is reporting a doubled value.

instance:node_network_transmit_bytes_excluding_lo:rate1m node_network_receive_bytes_total/node_network_transmit_bytes_total are unaffected.  They give figures that tally with the work generated by iperf3.


Version-Release number of selected component (if applicable):

4.5.11 / 4.5.13


How reproducible:

100%

Steps to Reproduce:

1. Deploy iperf3 server/client (https://github.com/k-wall/iperf3-yamls)
2. Use OpenShift Console to view the Graphana dashboard.

Actual results:

Received traffic is reported twice the expected value.

Expected results:

Network utilisation to be reported faithfully.


Additional info:

Comment 1 Keith Wall 2020-10-14 17:46:17 UTC
Created attachment 1721546 [details]
NetUtilisatiuonUseMethodCluster

Comment 2 Keith Wall 2020-10-14 17:50:46 UTC
Created attachment 1721547 [details]
NetUtilisationUseMethodNode

Comment 3 Keith Wall 2020-10-14 17:51:24 UTC
Created attachment 1721548 [details]
instance_node_network_receive_bytes_excluding_lo_rate

Comment 4 Keith Wall 2020-10-14 18:00:56 UTC
Created attachment 1721550 [details]
iperf-yamls.tar.gz

Comment 5 Simon Pasquier 2020-10-15 07:32:55 UTC
Can you run the following queries in the Prometheus UI:

sum by(device,instance) (rate(node_network_receive_bytes_total{job="node-exporter",instance="xxx",device!="lo"}[1m]))
sum by(device,instance) (rate(node_network_transmit_bytes_total{job="node-exporter",node="xxx",device!="lo"}[1m]))

That should help to see if there the network traffic is accounted for 2 devices.

Comment 6 Keith Wall 2020-10-15 10:24:16 UTC
I cleaned up my yamls to reproduce the problem.  Use these rather than the zip:
https://github.com/k-wall/iperf3-yamls

So running with these files, I have:

kwall@ovpn-113-108 iperf3-yamls % KUBECONFIG=~/src/mk-performance-tests/kafka-config  oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP           NODE                           NOMINATED NODE   READINESS GATES
iperf3-client-6bbd6759f7-fxfl2   1/1     Running   0          8m21s   10.128.2.9   ip-10-0-182-239.ec2.internal   <none>           <none>
iperf3-server-749d7b7468-kwdgb   1/1     Running   0          8m20s   10.128.6.7   ip-10-0-143-179.ec2.internal   <none>           <none>


Producing this work:

[  5] 346.00-347.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 347.00-348.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 348.00-349.00 sec   589 MBytes  4.94 Gbits/sec   14   2.41 MBytes
[  5] 349.00-350.00 sec   590 MBytes  4.95 Gbits/sec   14   2.41 MBytes
[  5] 350.00-351.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 351.00-352.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 352.00-353.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 353.00-354.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 354.00-355.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 355.00-356.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 356.00-357.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 357.00-358.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 358.00-359.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 359.00-360.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes


I attach the screenshots of the queries you requested (with the instance being the server pod)

Comment 7 Keith Wall 2020-10-15 10:24:58 UTC
Comment on attachment 1721550 [details]
iperf-yamls.tar.gz

Please use https://github.com/k-wall/iperf3-yamls instead.,

Comment 8 Keith Wall 2020-10-15 10:28:40 UTC
Created attachment 1721788 [details]
Capture requested by Simon 1

Comment 9 Keith Wall 2020-10-15 10:29:12 UTC
Created attachment 1721789 [details]
Capture requested by Simon 2

Comment 10 Keith Wall 2020-10-15 10:29:51 UTC
Created attachment 1721790 [details]
Capture requested by Simon 3

Comment 17 Junqi Zhao 2020-12-22 07:25:44 UTC
Followed the steps in Comment 0 and checked on 4.7.0-0.nightly-2020-12-20-055006 with the same node, "USE Method / Cluster" and "USE Method / Node" dashboard
result for "Net Utilisation (Bytes Receive)" is the same in both page
and almost the same result from prometheus query

Comment 18 Junqi Zhao 2020-12-22 07:29:11 UTC
Created attachment 1741291 [details]
USE Method / Node dashboard

Comment 19 Junqi Zhao 2020-12-22 07:29:54 UTC
Created attachment 1741292 [details]
USE Method / Cluster dashboard

Comment 22 errata-xmlrpc 2021-02-24 15:26:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633