Bug 1855556
Summary: | networking metrics include host network pods | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Pablo Alonso Rodriguez <palonsor> | |
Component: | Management Console | Assignee: | Rastislav Wagner <rawagner> | |
Status: | CLOSED ERRATA | QA Contact: | Yadan Pei <yapei> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.4 | CC: | alegrand, anpicker, aos-bugs, erooth, jokerman, kakkoyun, lcosic, mloibl, nmukherj, pkrupa, spadgett, spasquie, surbania, yapei | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Overview page used incorrect prometheus metric to show network utilization
Consequence: Network utilization included container level usage
Fix: Use node level metrics for Network utilization
Result: Network utilization shows correct data
|
Story Points: | --- | |
Clone Of: | ||||
: | 1873053 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:13:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1862885 | |||
Bug Blocks: | 1873053 |
Description
Pablo Alonso Rodriguez
2020-07-10 07:54:48 UTC
There are a couple of things mixed here I think let's take them apart. I think the initial observation of the cluster dashboard showing incorrect data is correct, this should be fixed. Since as you mentioned the container level metrics expose whatever interface they have available, which works as intended. I do agree that this can be confusing, I will ask upstream cAdvisor if this may be reconsidered, but I am not hopeful as all of cAdvisor works this way (I opened this upstream issue: https://github.com/google/cadvisor/issues/2615). My suggestion for a fix of the console would be to have the console not sum up the container metrics, but rather the node level metrics, which would be these recording rules * instance:node_network_receive_bytes_excluding_lo:rate1m * instance:node_network_transmit_bytes_excluding_lo:rate1m As this would be done by console, I'm going to transfer this bugzilla to the console component. Thanks. It is important to ensure that, if we sume node-level metrics, all the traffic from pods is still accounted. I mean: If 2 pods on the same node communicate with each other, that traffic may not be seen in hosts main interfaces (as there would be no need to send anything through vxlan). I think you have a point, but I don't see how we could combine that into a single graph, any suggestions? I would propose that the graph we have is about total cluster level network traffic which node network statistics would be most representative for. I think it would be valid to open an RFE for an *additional* graph being all pod network traffic (including non vxlan traffic). Thoughts? As long as we properly clarify that it is only node-level traffic, I can agree. The rest of what I suggested may require a RFE, as it may require some degree of redesign. Shouldnt we also adjust Top Consumer queries ? Right now we use topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace))) topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace))) Queries are: Top Project consumers IN: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace))) OUT: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace))) Top Pod consumers IN: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod))) OUT: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod))) Top Node consumers IN: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (node))) OUT: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (node))) Yes those queries are exhibiting the same problem. Top project and top pod are not possible with the current set of metrics. Top node would be fixed with the same metrics/recording-rules that I referenced earlier. verification blocked by bz 1862885 Now cluster utilization query is: By Node: Network in breakdown query is: topk(25, sort_desc(sum(instance:node_network_receive_bytes_excluding_lo:rate1m) BY (instance))) Network out breakdown query is: topk(25, sort_desc(sum(instance:node_network_transmit_bytes_excluding_lo:rate1m) BY (instance))) By Project: Network in breakdown query is: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace))) Network out breakdown query is: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace))) By Pods: Network in breakdown query is: topk(25, sort_desc(sum(rate(container_network_receive_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod))) Network out breakdown query is: topk(25, sort_desc(sum(rate(container_network_transmit_bytes_total{ container="POD", pod!= ""}[5m])) BY (namespace, pod))) By Node metrics query is using node_network_receive_bytes_excluding_lo & node_network_transmit_bytes_excluding_lo as suggested. Verified on 4.6.0-0.nightly-2020-08-23-185640 Let me know if this is wrong. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |