Bug 1888381 - instance:node_network_receive_bytes_excluding_lo:rate1m value twice expected
Summary: instance:node_network_receive_bytes_excluding_lo:rate1m value twice expected
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-14 17:44 UTC by Keith Wall
Modified: 2021-02-24 15:26 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:26:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
NetUtilisatiuonUseMethodCluster (576.32 KB, image/png)
2020-10-14 17:46 UTC, Keith Wall
no flags Details
NetUtilisationUseMethodNode (536.59 KB, image/png)
2020-10-14 17:50 UTC, Keith Wall
no flags Details
instance_node_network_receive_bytes_excluding_lo_rate (803.11 KB, image/png)
2020-10-14 17:51 UTC, Keith Wall
no flags Details
iperf-yamls.tar.gz (3.31 KB, application/gzip)
2020-10-14 18:00 UTC, Keith Wall
no flags Details
Capture requested by Simon 1 (945.38 KB, image/png)
2020-10-15 10:28 UTC, Keith Wall
no flags Details
Capture requested by Simon 2 (813.40 KB, image/png)
2020-10-15 10:29 UTC, Keith Wall
no flags Details
Capture requested by Simon 3 (870.63 KB, image/png)
2020-10-15 10:29 UTC, Keith Wall
no flags Details
USE Method / Node dashboard (98.87 KB, image/png)
2020-12-22 07:29 UTC, Junqi Zhao
no flags Details
USE Method / Cluster dashboard (94.36 KB, image/png)
2020-12-22 07:29 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 456 0 None closed dashboards/pod: Fix usage of duplicate cAdvisor time-series 2021-01-13 08:19:15 UTC
Github kubernetes-monitoring kubernetes-mixin pull 512 0 None closed dashboards/statefulset: Fix usage of duplicate cAdvisor time-series 2021-01-13 08:19:16 UTC
Github prometheus-operator kube-prometheus pull 818 0 None closed Update grafana dashboards and prometheus rules from kubernetes-mixin 2021-01-13 08:19:15 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:26:40 UTC

Description Keith Wall 2020-10-14 17:44:01 UTC
OSD: 4.5.13

Description of problem:

I ran iperf3 arranging the deployments so that server and client pod were on separate nodes so that a know amount of network traffic was generated (4.95Gbit/s, 618Mi/s).

Examining the "USE Method / Node" dashboard and "USE Method / Cluster" I noticed that the Network Utilisation for the Received traffic is twice that expected (1.227Gi/s).  The Transmit figures are as expected (611Mi/s). 

By querying Prometheus directly, I see that instance:node_network_receive_bytes_excluding_lo:rate1m that is reporting a doubled value.

instance:node_network_transmit_bytes_excluding_lo:rate1m node_network_receive_bytes_total/node_network_transmit_bytes_total are unaffected.  They give figures that tally with the work generated by iperf3.


Version-Release number of selected component (if applicable):

4.5.11 / 4.5.13


How reproducible:

100%

Steps to Reproduce:

1. Deploy iperf3 server/client (https://github.com/k-wall/iperf3-yamls)
2. Use OpenShift Console to view the Graphana dashboard.

Actual results:

Received traffic is reported twice the expected value.

Expected results:

Network utilisation to be reported faithfully.


Additional info:

Comment 1 Keith Wall 2020-10-14 17:46:17 UTC
Created attachment 1721546 [details]
NetUtilisatiuonUseMethodCluster

Comment 2 Keith Wall 2020-10-14 17:50:46 UTC
Created attachment 1721547 [details]
NetUtilisationUseMethodNode

Comment 3 Keith Wall 2020-10-14 17:51:24 UTC
Created attachment 1721548 [details]
instance_node_network_receive_bytes_excluding_lo_rate

Comment 4 Keith Wall 2020-10-14 18:00:56 UTC
Created attachment 1721550 [details]
iperf-yamls.tar.gz

Comment 5 Simon Pasquier 2020-10-15 07:32:55 UTC
Can you run the following queries in the Prometheus UI:

sum by(device,instance) (rate(node_network_receive_bytes_total{job="node-exporter",instance="xxx",device!="lo"}[1m]))
sum by(device,instance) (rate(node_network_transmit_bytes_total{job="node-exporter",node="xxx",device!="lo"}[1m]))

That should help to see if there the network traffic is accounted for 2 devices.

Comment 6 Keith Wall 2020-10-15 10:24:16 UTC
I cleaned up my yamls to reproduce the problem.  Use these rather than the zip:
https://github.com/k-wall/iperf3-yamls

So running with these files, I have:

kwall@ovpn-113-108 iperf3-yamls % KUBECONFIG=~/src/mk-performance-tests/kafka-config  oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP           NODE                           NOMINATED NODE   READINESS GATES
iperf3-client-6bbd6759f7-fxfl2   1/1     Running   0          8m21s   10.128.2.9   ip-10-0-182-239.ec2.internal   <none>           <none>
iperf3-server-749d7b7468-kwdgb   1/1     Running   0          8m20s   10.128.6.7   ip-10-0-143-179.ec2.internal   <none>           <none>


Producing this work:

[  5] 346.00-347.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 347.00-348.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 348.00-349.00 sec   589 MBytes  4.94 Gbits/sec   14   2.41 MBytes
[  5] 349.00-350.00 sec   590 MBytes  4.95 Gbits/sec   14   2.41 MBytes
[  5] 350.00-351.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 351.00-352.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 352.00-353.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 353.00-354.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 354.00-355.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 355.00-356.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 356.00-357.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 357.00-358.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 358.00-359.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 359.00-360.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes


I attach the screenshots of the queries you requested (with the instance being the server pod)

Comment 7 Keith Wall 2020-10-15 10:24:58 UTC
Comment on attachment 1721550 [details]
iperf-yamls.tar.gz

Please use https://github.com/k-wall/iperf3-yamls instead.,

Comment 8 Keith Wall 2020-10-15 10:28:40 UTC
Created attachment 1721788 [details]
Capture requested by Simon 1

Comment 9 Keith Wall 2020-10-15 10:29:12 UTC
Created attachment 1721789 [details]
Capture requested by Simon 2

Comment 10 Keith Wall 2020-10-15 10:29:51 UTC
Created attachment 1721790 [details]
Capture requested by Simon 3

Comment 17 Junqi Zhao 2020-12-22 07:25:44 UTC
Followed the steps in Comment 0 and checked on 4.7.0-0.nightly-2020-12-20-055006 with the same node, "USE Method / Cluster" and "USE Method / Node" dashboard
result for "Net Utilisation (Bytes Receive)" is the same in both page
and almost the same result from prometheus query

Comment 18 Junqi Zhao 2020-12-22 07:29:11 UTC
Created attachment 1741291 [details]
USE Method / Node dashboard

Comment 19 Junqi Zhao 2020-12-22 07:29:54 UTC
Created attachment 1741292 [details]
USE Method / Cluster dashboard

Comment 22 errata-xmlrpc 2021-02-24 15:26:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.