Bug 1670330

Summary: node:node_net_utilisation:sum_irate recording errors due to locked interface
Product: OpenShift Container Platform Reporter: Kim Borup <kborup>
Component: MonitoringAssignee: Matthias Loibl <mloibl>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: adeshpan, calfonso, dcaldwel, grodrigu, jkaur, lserven, mloibl, mluther, mmariyan, rdiazgav, romank, sauchter, sponnaga, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:42:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kim Borup 2019-01-29 09:56:04 UTC
Description of problem:
In the monitoring operator the network interface for components are set to record traffic from iface eth0, this is a static setting, that will break some monitors in case iface name is ens192. 

Version-Release number of selected component (if applicable):
3.11.z

How reproducible:
Every install

Steps to Reproduce:
1. Install OCP with Monitoring stack on platform where iface name is not eth0
2. Check stats for network in grafana / Prometheus
3.

Actual results:
node:node_net_utilisation:sum_irate Missing in prometheus due to network name not eth0

Expected results:
Network graph

Additional info:

Changed the following record from device=eth0 to device=ens192 for my current cluster, which caused network monitoring to start working like intended. 

      record: node:node_disk_saturation:avg_irate
    - expr: |
        sum(irate(node_network_receive_bytes{job="node-exporter",device="eth0"}[1m])) +
        sum(irate(node_network_transmit_bytes{job="node-exporter",device="eth0"}[1m]))
      record: :node_net_utilisation:sum_irate
    - expr: |
        sum by (node) (
          (irate(node_network_receive_bytes{job="node-exporter",device="eth0"}[1m]) +
          irate(node_network_transmit_bytes{job="node-exporter",device="eth0"}[1m]))
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        )
      record: node:node_net_utilisation:sum_irate
    - expr: |
        sum(irate(node_network_receive_drop{job="node-exporter",device="eth0"}[1m])) +
        sum(irate(node_network_transmit_drop{job="node-exporter",device="eth0"}[1m]))
      record: :node_net_saturation:sum_irate
    - expr: |
        sum by (node) (
          (irate(node_network_receive_drop{job="node-exporter",device="eth0"}[1m]) +
          irate(node_network_transmit_drop{job="node-exporter",device="eth0"}[1m]))
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:

Comment 1 minden 2019-01-29 14:40:11 UTC
The network interface selector is already configurable in kubernetes mixin project [1], defaulting to `eth0` [2].

This gives us the possibility of adjusting the interface selector, but at the wrong stage, at cluster monitoring operator compile time, and not at run time.

Maybe mloibl knows of any case, where we have been templating rule values at run time before?



[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/rules/rules.libsonnet#L328

[2] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/config.libsonnet#L15

Comment 2 minden 2019-01-29 14:55:21 UTC
After talking to Matthias, Casey and Frederic, we can change the default value 'device="eth0"' to a regex ignoring the interfaces that we don't want. In the long term the network operator could expose the names for us, which could then be templated into the rules manifest by the cluster monitoring operator.

Assigning to Matthias for now. Let me know if you want me to further look into this.

Comment 3 minden 2019-01-31 16:03:00 UTC
https://github.com/openshift/cluster-monitoring-operator/pull/226 merged, hence this fix will soon be available in Openshift 4.0. Thanks for the report and thanks Matthias for looking into this.

Comment 4 Junqi Zhao 2019-03-01 00:22:53 UTC
It seems it is the same bug as bug 1654907

Comment 6 Junqi Zhao 2019-03-06 15:35:14 UTC
*** Bug 1654907 has been marked as a duplicate of this bug. ***

Comment 7 Junqi Zhao 2019-03-12 08:10:23 UTC
Tested with 4.0.0-0.nightly-2019-03-06-074438
device name is not restricted to eth0, "veth.+" devices are excluded, could show stats for network in grafana / Prometheus

Comment 20 errata-xmlrpc 2019-06-04 10:42:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Comment 23 Red Hat Bugzilla 2023-09-15 00:15:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days