Bug 1670330 - node:node_net_utilisation:sum_irate recording errors due to locked interface [NEEDINFO]
Summary: node:node_net_utilisation:sum_irate recording errors due to locked interface
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.1.0
Assignee: Matthias Loibl
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1654907 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-29 09:56 UTC by Kim Borup
Modified: 2019-07-23 15:04 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:42:19 UTC
Target Upstream Version:
mmariyan: needinfo? (mloibl)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:42:25 UTC

Description Kim Borup 2019-01-29 09:56:04 UTC
Description of problem:
In the monitoring operator the network interface for components are set to record traffic from iface eth0, this is a static setting, that will break some monitors in case iface name is ens192. 

Version-Release number of selected component (if applicable):
3.11.z

How reproducible:
Every install

Steps to Reproduce:
1. Install OCP with Monitoring stack on platform where iface name is not eth0
2. Check stats for network in grafana / Prometheus
3.

Actual results:
node:node_net_utilisation:sum_irate Missing in prometheus due to network name not eth0

Expected results:
Network graph

Additional info:

Changed the following record from device=eth0 to device=ens192 for my current cluster, which caused network monitoring to start working like intended. 

      record: node:node_disk_saturation:avg_irate
    - expr: |
        sum(irate(node_network_receive_bytes{job="node-exporter",device="eth0"}[1m])) +
        sum(irate(node_network_transmit_bytes{job="node-exporter",device="eth0"}[1m]))
      record: :node_net_utilisation:sum_irate
    - expr: |
        sum by (node) (
          (irate(node_network_receive_bytes{job="node-exporter",device="eth0"}[1m]) +
          irate(node_network_transmit_bytes{job="node-exporter",device="eth0"}[1m]))
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        )
      record: node:node_net_utilisation:sum_irate
    - expr: |
        sum(irate(node_network_receive_drop{job="node-exporter",device="eth0"}[1m])) +
        sum(irate(node_network_transmit_drop{job="node-exporter",device="eth0"}[1m]))
      record: :node_net_saturation:sum_irate
    - expr: |
        sum by (node) (
          (irate(node_network_receive_drop{job="node-exporter",device="eth0"}[1m]) +
          irate(node_network_transmit_drop{job="node-exporter",device="eth0"}[1m]))
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:

Comment 1 minden 2019-01-29 14:40:11 UTC
The network interface selector is already configurable in kubernetes mixin project [1], defaulting to `eth0` [2].

This gives us the possibility of adjusting the interface selector, but at the wrong stage, at cluster monitoring operator compile time, and not at run time.

Maybe mloibl@redhat.com knows of any case, where we have been templating rule values at run time before?



[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/rules/rules.libsonnet#L328

[2] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/config.libsonnet#L15

Comment 2 minden 2019-01-29 14:55:21 UTC
After talking to Matthias, Casey and Frederic, we can change the default value 'device="eth0"' to a regex ignoring the interfaces that we don't want. In the long term the network operator could expose the names for us, which could then be templated into the rules manifest by the cluster monitoring operator.

Assigning to Matthias for now. Let me know if you want me to further look into this.

Comment 3 minden 2019-01-31 16:03:00 UTC
https://github.com/openshift/cluster-monitoring-operator/pull/226 merged, hence this fix will soon be available in Openshift 4.0. Thanks for the report and thanks Matthias for looking into this.

Comment 4 Junqi Zhao 2019-03-01 00:22:53 UTC
It seems it is the same bug as bug 1654907

Comment 6 Junqi Zhao 2019-03-06 15:35:14 UTC
*** Bug 1654907 has been marked as a duplicate of this bug. ***

Comment 7 Junqi Zhao 2019-03-12 08:10:23 UTC
Tested with 4.0.0-0.nightly-2019-03-06-074438
device name is not restricted to eth0, "veth.+" devices are excluded, could show stats for network in grafana / Prometheus

Comment 20 errata-xmlrpc 2019-06-04 10:42:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.