Bug 1888381

Summary:

instance:node_network_receive_bytes_excluding_lo:rate1m value twice expected

Product:

OpenShift Container Platform

Reporter:

Keith Wall <kwall>

Component:

Monitoring

Assignee:

Pawel Krupa <pkrupa>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

low

Docs Contact:

Priority:

low

Version:

4.5

CC:

alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, spasquie, surbania

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 15:26:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
NetUtilisatiuonUseMethodCluster	none
NetUtilisationUseMethodNode	none
instance_node_network_receive_bytes_excluding_lo_rate	none
iperf-yamls.tar.gz	none
Capture requested by Simon 1	none
Capture requested by Simon 2	none
Capture requested by Simon 3	none
USE Method / Node dashboard	none
USE Method / Cluster dashboard	none

Description Keith Wall 2020-10-14 17:44:01 UTC

OSD: 4.5.13

Description of problem:

I ran iperf3 arranging the deployments so that server and client pod were on separate nodes so that a know amount of network traffic was generated (4.95Gbit/s, 618Mi/s).

Examining the "USE Method / Node" dashboard and "USE Method / Cluster" I noticed that the Network Utilisation for the Received traffic is twice that expected (1.227Gi/s).  The Transmit figures are as expected (611Mi/s). 

By querying Prometheus directly, I see that instance:node_network_receive_bytes_excluding_lo:rate1m that is reporting a doubled value.

instance:node_network_transmit_bytes_excluding_lo:rate1m node_network_receive_bytes_total/node_network_transmit_bytes_total are unaffected.  They give figures that tally with the work generated by iperf3.


Version-Release number of selected component (if applicable):

4.5.11 / 4.5.13


How reproducible:

100%

Steps to Reproduce:

1. Deploy iperf3 server/client (https://github.com/k-wall/iperf3-yamls)
2. Use OpenShift Console to view the Graphana dashboard.

Actual results:

Received traffic is reported twice the expected value.

Expected results:

Network utilisation to be reported faithfully.


Additional info:

Comment 1 Keith Wall 2020-10-14 17:46:17 UTC

Created attachment 1721546 [details]
NetUtilisatiuonUseMethodCluster

Comment 2 Keith Wall 2020-10-14 17:50:46 UTC

Created attachment 1721547 [details]
NetUtilisationUseMethodNode

Comment 3 Keith Wall 2020-10-14 17:51:24 UTC

Created attachment 1721548 [details]
instance_node_network_receive_bytes_excluding_lo_rate

Comment 4 Keith Wall 2020-10-14 18:00:56 UTC

Created attachment 1721550 [details]
iperf-yamls.tar.gz

Comment 5 Simon Pasquier 2020-10-15 07:32:55 UTC

Can you run the following queries in the Prometheus UI:

sum by(device,instance) (rate(node_network_receive_bytes_total{job="node-exporter",instance="xxx",device!="lo"}[1m]))
sum by(device,instance) (rate(node_network_transmit_bytes_total{job="node-exporter",node="xxx",device!="lo"}[1m]))

That should help to see if there the network traffic is accounted for 2 devices.

Comment 6 Keith Wall 2020-10-15 10:24:16 UTC

I cleaned up my yamls to reproduce the problem.  Use these rather than the zip:
https://github.com/k-wall/iperf3-yamls

So running with these files, I have:

kwall@ovpn-113-108 iperf3-yamls % KUBECONFIG=~/src/mk-performance-tests/kafka-config  oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP           NODE                           NOMINATED NODE   READINESS GATES
iperf3-client-6bbd6759f7-fxfl2   1/1     Running   0          8m21s   10.128.2.9   ip-10-0-182-239.ec2.internal   <none>           <none>
iperf3-server-749d7b7468-kwdgb   1/1     Running   0          8m20s   10.128.6.7   ip-10-0-143-179.ec2.internal   <none>           <none>


Producing this work:

[  5] 346.00-347.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 347.00-348.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 348.00-349.00 sec   589 MBytes  4.94 Gbits/sec   14   2.41 MBytes
[  5] 349.00-350.00 sec   590 MBytes  4.95 Gbits/sec   14   2.41 MBytes
[  5] 350.00-351.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 351.00-352.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 352.00-353.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 353.00-354.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 354.00-355.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 355.00-356.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 356.00-357.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 357.00-358.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 358.00-359.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 359.00-360.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes


I attach the screenshots of the queries you requested (with the instance being the server pod)

Comment 7 Keith Wall 2020-10-15 10:24:58 UTC

Comment on attachment 1721550 [details]
iperf-yamls.tar.gz

Please use https://github.com/k-wall/iperf3-yamls instead.,

Comment 8 Keith Wall 2020-10-15 10:28:40 UTC

Created attachment 1721788 [details]
Capture requested by Simon 1

Comment 9 Keith Wall 2020-10-15 10:29:12 UTC

Created attachment 1721789 [details]
Capture requested by Simon 2

Comment 10 Keith Wall 2020-10-15 10:29:51 UTC

Created attachment 1721790 [details]
Capture requested by Simon 3

Comment 17 Junqi Zhao 2020-12-22 07:25:44 UTC

Followed the steps in Comment 0 and checked on 4.7.0-0.nightly-2020-12-20-055006 with the same node, "USE Method / Cluster" and "USE Method / Node" dashboard
result for "Net Utilisation (Bytes Receive)" is the same in both page
and almost the same result from prometheus query

Comment 18 Junqi Zhao 2020-12-22 07:29:11 UTC

Created attachment 1741291 [details]
USE Method / Node dashboard

Comment 19 Junqi Zhao 2020-12-22 07:29:54 UTC

Created attachment 1741292 [details]
USE Method / Cluster dashboard

Comment 22 errata-xmlrpc 2021-02-24 15:26:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633