1670330 – node:node_net_utilisation:sum_irate recording errors due to locked interface

Bug 1670330 - node:node_net_utilisation:sum_irate recording errors due to locked interface

Summary: node:node_net_utilisation:sum_irate recording errors due to locked interface

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Matthias Loibl
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1654907 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-29 09:56 UTC by Kim Borup
Modified:	2023-09-15 00:15 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:42:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:42:25 UTC

Description Kim Borup 2019-01-29 09:56:04 UTC

Description of problem:
In the monitoring operator the network interface for components are set to record traffic from iface eth0, this is a static setting, that will break some monitors in case iface name is ens192. 

Version-Release number of selected component (if applicable):
3.11.z

How reproducible:
Every install

Steps to Reproduce:
1. Install OCP with Monitoring stack on platform where iface name is not eth0
2. Check stats for network in grafana / Prometheus
3.

Actual results:
node:node_net_utilisation:sum_irate Missing in prometheus due to network name not eth0

Expected results:
Network graph

Additional info:

Changed the following record from device=eth0 to device=ens192 for my current cluster, which caused network monitoring to start working like intended. 

      record: node:node_disk_saturation:avg_irate
    - expr: |
        sum(irate(node_network_receive_bytes{job="node-exporter",device="eth0"}[1m])) +
        sum(irate(node_network_transmit_bytes{job="node-exporter",device="eth0"}[1m]))
      record: :node_net_utilisation:sum_irate
    - expr: |
        sum by (node) (
          (irate(node_network_receive_bytes{job="node-exporter",device="eth0"}[1m]) +
          irate(node_network_transmit_bytes{job="node-exporter",device="eth0"}[1m]))
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        )
      record: node:node_net_utilisation:sum_irate
    - expr: |
        sum(irate(node_network_receive_drop{job="node-exporter",device="eth0"}[1m])) +
        sum(irate(node_network_transmit_drop{job="node-exporter",device="eth0"}[1m]))
      record: :node_net_saturation:sum_irate
    - expr: |
        sum by (node) (
          (irate(node_network_receive_drop{job="node-exporter",device="eth0"}[1m]) +
          irate(node_network_transmit_drop{job="node-exporter",device="eth0"}[1m]))
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:

Comment 1 minden 2019-01-29 14:40:11 UTC

The network interface selector is already configurable in kubernetes mixin project [1], defaulting to `eth0` [2].

This gives us the possibility of adjusting the interface selector, but at the wrong stage, at cluster monitoring operator compile time, and not at run time.

Maybe mloibl knows of any case, where we have been templating rule values at run time before?



[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/rules/rules.libsonnet#L328

[2] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/config.libsonnet#L15

Comment 2 minden 2019-01-29 14:55:21 UTC

After talking to Matthias, Casey and Frederic, we can change the default value 'device="eth0"' to a regex ignoring the interfaces that we don't want. In the long term the network operator could expose the names for us, which could then be templated into the rules manifest by the cluster monitoring operator.

Assigning to Matthias for now. Let me know if you want me to further look into this.

Comment 3 minden 2019-01-31 16:03:00 UTC

https://github.com/openshift/cluster-monitoring-operator/pull/226 merged, hence this fix will soon be available in Openshift 4.0. Thanks for the report and thanks Matthias for looking into this.

Comment 4 Junqi Zhao 2019-03-01 00:22:53 UTC

It seems it is the same bug as bug 1654907

Comment 6 Junqi Zhao 2019-03-06 15:35:14 UTC

*** Bug 1654907 has been marked as a duplicate of this bug. ***

Comment 7 Junqi Zhao 2019-03-12 08:10:23 UTC

Tested with 4.0.0-0.nightly-2019-03-06-074438
device name is not restricted to eth0, "veth.+" devices are excluded, could show stats for network in grafana / Prometheus

Comment 20 errata-xmlrpc 2019-06-04 10:42:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Comment 23 Red Hat Bugzilla 2023-09-15 00:15:29 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.