Bug 1855556 - networking metrics include host network pods
Summary: networking metrics include host network pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Rastislav Wagner
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On: 1862885
Blocks: 1873053
TreeView+ depends on / blocked
 
Reported: 2020-07-10 07:54 UTC by Pablo Alonso Rodriguez
Modified: 2023-12-15 18:26 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Overview page used incorrect prometheus metric to show network utilization Consequence: Network utilization included container level usage Fix: Use node level metrics for Network utilization Result: Network utilization shows correct data
Clone Of:
: 1873053 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:13:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift console pull 6051 0 None closed Bug 1855556: Include only node-level network traffic in Network Utilization item. 2021-01-04 09:39:40 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:13:49 UTC

Description Pablo Alonso Rodriguez 2020-07-10 07:54:48 UTC
Description of problem:

The monitoring system does not make any distinction between pods that are in the host network and pods that aren't. 

As a pod in the host network does not have its own networking stack but just views the host's monitoring stack, the node metrics are accounted.

This can lead to some issues, like:
- Node metrics are shown as if they were the individual metrics of some pods, which can be confusing
- If the networking metrics of all the projects are summed to get a total consumption (like management console does), for each host network pod, the metrics of the node where it resides are summed as if the metrics of an individual pod were, completely distorting the metric (for each host network pod, the metrics of the node where it resides are summed).

So best solution would be to not account host network pods in the metrics.

Version-Release number of selected component (if applicable):

4.4

How reproducible:

Always

Steps to Reproduce:
1. Look at networking metrics of any pod in the host network, like many system pods.
2.
3.

Actual results:

Host network pod metrics are included and summed as if they were individual pods with their own networking stack, causing distortion.

Expected results:

Only account metrics for pods that are not in the host network.

Additional info:

Comment 2 Frederic Branczyk 2020-07-10 13:25:29 UTC
There are a couple of things mixed here I think let's take them apart. I think the initial observation of the cluster dashboard showing incorrect data is correct, this should be fixed. Since as you mentioned the container level metrics expose whatever interface they have available, which works as intended. I do agree that this can be confusing, I will ask upstream cAdvisor if this may be reconsidered, but I am not hopeful as all of cAdvisor works this way (I opened this upstream issue: https://github.com/google/cadvisor/issues/2615).

My suggestion for a fix of the console would be to have the console not sum up the container metrics, but rather the node level metrics, which would be these recording rules

* instance:node_network_receive_bytes_excluding_lo:rate1m
* instance:node_network_transmit_bytes_excluding_lo:rate1m

As this would be done by console, I'm going to transfer this bugzilla to the console component.

Comment 3 Pablo Alonso Rodriguez 2020-07-14 11:46:23 UTC
Thanks.

It is important to ensure that, if we sume node-level metrics, all the traffic from pods is still accounted. I mean: If 2 pods on the same node communicate with each other, that traffic may not be seen in hosts main interfaces (as there would be no need to send anything through vxlan).

Comment 4 Frederic Branczyk 2020-07-14 12:07:42 UTC
I think you have a point, but I don't see how we could combine that into a single graph, any suggestions? I would propose that the graph we have is about total cluster level network traffic which node network statistics would be most representative for. I think it would be valid to open an RFE for an *additional* graph being all pod network traffic (including non vxlan traffic). Thoughts?

Comment 5 Pablo Alonso Rodriguez 2020-07-14 12:18:57 UTC
As long as we properly clarify that it is only node-level traffic, I can agree.

The rest of what I suggested may require a RFE, as it may require some degree of redesign.

Comment 6 Rastislav Wagner 2020-07-21 09:27:38 UTC
Shouldnt we also adjust Top Consumer queries ? Right now we use

topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace)))
topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace)))

Comment 7 Rastislav Wagner 2020-07-21 09:34:04 UTC
Queries are:
Top Project consumers
IN: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace)))
OUT: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace)))

Top Pod consumers
IN: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod)))
OUT: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod)))

Top Node consumers
IN: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (node)))
OUT: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (node)))

Comment 8 Frederic Branczyk 2020-07-21 10:01:41 UTC
Yes those queries are exhibiting the same problem. Top project and top pod are not possible with the current set of metrics. Top node would be fixed with the same metrics/recording-rules that I referenced earlier.

Comment 11 Yadan Pei 2020-08-03 06:21:43 UTC
verification blocked by bz 1862885

Comment 12 Yadan Pei 2020-08-24 06:00:53 UTC
Now cluster utilization query is:


By Node:
Network in breakdown query is: topk(25, sort_desc(sum(instance:node_network_receive_bytes_excluding_lo:rate1m) BY (instance)))
Network out breakdown query is: topk(25, sort_desc(sum(instance:node_network_transmit_bytes_excluding_lo:rate1m) BY (instance)))

By Project:
Network in breakdown query is: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace)))
Network out breakdown query is: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace)))

By Pods:
Network in breakdown query is: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod)))
Network out breakdown query is: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod)))


By Node metrics query is using node_network_receive_bytes_excluding_lo & node_network_transmit_bytes_excluding_lo as suggested.

Verified on 4.6.0-0.nightly-2020-08-23-185640

Let me know if this is wrong.

Comment 14 errata-xmlrpc 2020-10-27 16:13:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.