1888381 – instance:node_network_receive_bytes_excluding_lo:rate1m value twice expected

Bug 1888381 - instance:node_network_receive_bytes_excluding_lo:rate1m value twice expected

Summary: instance:node_network_receive_bytes_excluding_lo:rate1m value twice expected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Pawel Krupa
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-14 17:44 UTC by Keith Wall
Modified:	2021-02-24 15:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:26:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
NetUtilisatiuonUseMethodCluster (576.32 KB, image/png) 2020-10-14 17:46 UTC, Keith Wall	no flags	Details
NetUtilisationUseMethodNode (536.59 KB, image/png) 2020-10-14 17:50 UTC, Keith Wall	no flags	Details
instance_node_network_receive_bytes_excluding_lo_rate (803.11 KB, image/png) 2020-10-14 17:51 UTC, Keith Wall	no flags	Details
iperf-yamls.tar.gz (3.31 KB, application/gzip) 2020-10-14 18:00 UTC, Keith Wall	no flags	Details
Capture requested by Simon 1 (945.38 KB, image/png) 2020-10-15 10:28 UTC, Keith Wall	no flags	Details
Capture requested by Simon 2 (813.40 KB, image/png) 2020-10-15 10:29 UTC, Keith Wall	no flags	Details
Capture requested by Simon 3 (870.63 KB, image/png) 2020-10-15 10:29 UTC, Keith Wall	no flags	Details
USE Method / Node dashboard (98.87 KB, image/png) 2020-12-22 07:29 UTC, Junqi Zhao	no flags	Details
USE Method / Cluster dashboard (94.36 KB, image/png) 2020-12-22 07:29 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin pull 456	None	closed	dashboards/pod: Fix usage of duplicate cAdvisor time-series	2021-01-13 08:19:15 UTC
Github	kubernetes-monitoring kubernetes-mixin pull 512	None	closed	dashboards/statefulset: Fix usage of duplicate cAdvisor time-series	2021-01-13 08:19:16 UTC
Github	prometheus-operator kube-prometheus pull 818	None	closed	Update grafana dashboards and prometheus rules from kubernetes-mixin	2021-01-13 08:19:15 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:26:40 UTC

Description Keith Wall 2020-10-14 17:44:01 UTC

OSD: 4.5.13

Description of problem:

I ran iperf3 arranging the deployments so that server and client pod were on separate nodes so that a know amount of network traffic was generated (4.95Gbit/s, 618Mi/s).

Examining the "USE Method / Node" dashboard and "USE Method / Cluster" I noticed that the Network Utilisation for the Received traffic is twice that expected (1.227Gi/s).  The Transmit figures are as expected (611Mi/s). 

By querying Prometheus directly, I see that instance:node_network_receive_bytes_excluding_lo:rate1m that is reporting a doubled value.

instance:node_network_transmit_bytes_excluding_lo:rate1m node_network_receive_bytes_total/node_network_transmit_bytes_total are unaffected.  They give figures that tally with the work generated by iperf3.


Version-Release number of selected component (if applicable):

4.5.11 / 4.5.13


How reproducible:

100%

Steps to Reproduce:

1. Deploy iperf3 server/client (https://github.com/k-wall/iperf3-yamls)
2. Use OpenShift Console to view the Graphana dashboard.

Actual results:

Received traffic is reported twice the expected value.

Expected results:

Network utilisation to be reported faithfully.


Additional info:

Comment 1 Keith Wall 2020-10-14 17:46:17 UTC

Created attachment 1721546 [details]
NetUtilisatiuonUseMethodCluster

Comment 2 Keith Wall 2020-10-14 17:50:46 UTC

Created attachment 1721547 [details]
NetUtilisationUseMethodNode

Comment 3 Keith Wall 2020-10-14 17:51:24 UTC

Created attachment 1721548 [details]
instance_node_network_receive_bytes_excluding_lo_rate

Comment 4 Keith Wall 2020-10-14 18:00:56 UTC

Created attachment 1721550 [details]
iperf-yamls.tar.gz

Comment 5 Simon Pasquier 2020-10-15 07:32:55 UTC

Can you run the following queries in the Prometheus UI:

sum by(device,instance) (rate(node_network_receive_bytes_total{job="node-exporter",instance="xxx",device!="lo"}[1m]))
sum by(device,instance) (rate(node_network_transmit_bytes_total{job="node-exporter",node="xxx",device!="lo"}[1m]))

That should help to see if there the network traffic is accounted for 2 devices.

Comment 6 Keith Wall 2020-10-15 10:24:16 UTC

I cleaned up my yamls to reproduce the problem.  Use these rather than the zip:
https://github.com/k-wall/iperf3-yamls

So running with these files, I have:

kwall@ovpn-113-108 iperf3-yamls % KUBECONFIG=~/src/mk-performance-tests/kafka-config  oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP           NODE                           NOMINATED NODE   READINESS GATES
iperf3-client-6bbd6759f7-fxfl2   1/1     Running   0          8m21s   10.128.2.9   ip-10-0-182-239.ec2.internal   <none>           <none>
iperf3-server-749d7b7468-kwdgb   1/1     Running   0          8m20s   10.128.6.7   ip-10-0-143-179.ec2.internal   <none>           <none>


Producing this work:

[  5] 346.00-347.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 347.00-348.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 348.00-349.00 sec   589 MBytes  4.94 Gbits/sec   14   2.41 MBytes
[  5] 349.00-350.00 sec   590 MBytes  4.95 Gbits/sec   14   2.41 MBytes
[  5] 350.00-351.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 351.00-352.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 352.00-353.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 353.00-354.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 354.00-355.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 355.00-356.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 356.00-357.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes
[  5] 357.00-358.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 358.00-359.00 sec   590 MBytes  4.95 Gbits/sec    0   2.41 MBytes
[  5] 359.00-360.00 sec   589 MBytes  4.94 Gbits/sec    0   2.41 MBytes


I attach the screenshots of the queries you requested (with the instance being the server pod)

Comment 7 Keith Wall 2020-10-15 10:24:58 UTC

Comment on attachment 1721550 [details]
iperf-yamls.tar.gz

Please use https://github.com/k-wall/iperf3-yamls instead.,

Comment 8 Keith Wall 2020-10-15 10:28:40 UTC

Created attachment 1721788 [details]
Capture requested by Simon 1

Comment 9 Keith Wall 2020-10-15 10:29:12 UTC

Created attachment 1721789 [details]
Capture requested by Simon 2

Comment 10 Keith Wall 2020-10-15 10:29:51 UTC

Created attachment 1721790 [details]
Capture requested by Simon 3

Comment 17 Junqi Zhao 2020-12-22 07:25:44 UTC

Followed the steps in Comment 0 and checked on 4.7.0-0.nightly-2020-12-20-055006 with the same node, "USE Method / Cluster" and "USE Method / Node" dashboard
result for "Net Utilisation (Bytes Receive)" is the same in both page
and almost the same result from prometheus query

Comment 18 Junqi Zhao 2020-12-22 07:29:11 UTC

Created attachment 1741291 [details]
USE Method / Node dashboard

Comment 19 Junqi Zhao 2020-12-22 07:29:54 UTC

Created attachment 1741292 [details]
USE Method / Cluster dashboard

Comment 22 errata-xmlrpc 2021-02-24 15:26:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.