Bug 1950810

Summary: zero for container_network_tcp_usage_total and container_network_udp_usage_total
Product: OpenShift Container Platform Reporter: Venkata Tadimarri <ktadimar>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: alegrand, anpicker, aos-bugs, arajkuma, erooth, kakkoyun, lcosic, mahmad, mirollin, nagrawal, pkrupa, rphillips, spasquie, surbania
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-07 11:01:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Venkata Tadimarri 2021-04-18 22:27:23 UTC
Description of problem:

Cloned from : https://bugzilla.redhat.com/show_bug.cgi?id=1668315


Version-Release number of selected component (if applicable):
 3.11.306


Secure environment.


Customer is seeing a non zero value for container_network_tcp_usage_total and container_network_udp_usage_total.  

As per the bug mentioned earlier (1668315) and  https://github.com/google/cadvisor/issues/1925 , these values are supposed to be zero and disabled. However, this doesn't seem to be the case. 

Example 1: 

[openshift@master-1 ~]$ server=app-node-0.openshift.mydomain
[openshift@master-1 ~]$ curl -s -X GET -H "Authorization: Bearer $(oc whoami -t)" https://$server:10250/metrics/cadvisor |egrep '(container_network_tcp_usage_total|container_network_udp_usage_total)'  |wc -l
3694

Example2:

The way to check is by running the following query in the prometheus ui:


URL: https://prometheus-k8s-openshift-monitoring.apps.openshift.mydomain/graph?g0.range_input=1h&g0.expr=topk(10%2C%20count%20by%20(__name__)(%7B__name__%3D~%22.%2B%22%7D))&g0.tab=1


Query: topk(10, count by (__name__)({__name__=~".+"}))


Results: container_network_tcp_usage_total has a non-zero value 175450, when it is supposed to be zero, and this is creating an extra load on the monitoring solution. 

cAdvisor is producing metrics even though it is not supposed to causing performance problems and later on affecting their ability to monitor the environments effectively.

Comment 4 Mohammad 2021-04-26 20:39:38 UTC
Sorry, re-opening this as we need the fix which was done for https://bugzilla.redhat.com/show_bug.cgi?id=1668315 in OCP 4.1 backported to OCP3.11.

Basically, the stats (as per https://bugzilla.redhat.com/show_bug.cgi?id=1668315#c3) should be zero, when they are not. Happy to provide more info.

Comment 18 Junqi Zhao 2021-06-30 13:14:59 UTC
tested with ose-cluster-monitoring-operator:v3.11.463, container_network_tcp_usage_total and container_network_udp_usage_total metrics are removed
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep -E "container_network_tcp_usage_total|container_network_udp_usage_total"
no result

Comment 22 errata-xmlrpc 2021-07-07 11:01:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.465 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2639