1700057 – NodeNetworkInterfaceDown alerts probably should not be alerts

Bug 1700057 - NodeNetworkInterfaceDown alerts probably should not be alerts

Summary: NodeNetworkInterfaceDown alerts probably should not be alerts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-15 17:22 UTC by Seth Jennings
Modified:	2019-06-04 10:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
prom-node-network-up.png (224.88 KB, image/png) 2019-04-15 17:22 UTC, Seth Jennings	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:43 UTC

Description Seth Jennings 2019-04-15 17:22:33 UTC

Created attachment 1555287 [details]
prom-node-network-up.png

node-exporter is reporting ~5 interfaces per node in my cluster are down triggering 35 alerts in alert manager (and the console).

All nodes are healthy

$ ogcv
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-04-15-011801   True        False         3h29m   Cluster version is 4.0.0-0.ci-2019-04-15-011801

$ oc get nodes
NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   3h45m   v1.13.4+412644ac7
master-1   Ready    master   3h45m   v1.13.4+412644ac7
master-2   Ready    master   3h45m   v1.13.4+412644ac7
worker-0   Ready    worker   87m     v1.13.4+412644ac7
worker-1   Ready    worker   3h44m   v1.13.4+412644ac7
worker-2   Ready    worker   3h45m   v1.13.4+412644ac7

$ oc project openshift-sdn
Now using project "openshift-sdn" on server "https://api.lab.variantweb.net:6443".
[sjennings@cerebellum ~]$ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
ovs-4qlpt              1/1     Running   0          3h45m
ovs-5zlhm              1/1     Running   0          3h45m
ovs-7v42x              1/1     Running   0          3h45m
ovs-9n8cg              1/1     Running   0          3h45m
ovs-dlj92              1/1     Running   0          3h45m
ovs-sz7lc              1/1     Running   0          87m
sdn-controller-4g8bq   1/1     Running   0          3h45m
sdn-controller-fxvm9   1/1     Running   0          3h45m
sdn-controller-js7mj   1/1     Running   0          3h45m
sdn-cx6n9              1/1     Running   0          87m
sdn-jvjr4              1/1     Running   0          3h45m
sdn-m7ks6              1/1     Running   0          3h45m
sdn-qp7pl              1/1     Running   0          3h45m
sdn-th7q5              1/1     Running   0          3h45m
sdn-vlfdk              1/1     Running   0          3h45m

Logging into a worker

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:5e:f9:72 brd ff:ff:ff:ff:ff:ff
    inet 10.42.10.213/24 brd 10.42.10.255 scope global dynamic noprefixroute ens3
       valid_lft 81087sec preferred_lft 81087sec
    inet6 fe80::eebb:b0d5:76f6:606a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:69:8f:50:35:14 brd ff:ff:ff:ff:ff:ff
4: br0: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:7a:b0:33:41:46 brd ff:ff:ff:ff:ff:ff
5: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether a2:d8:c9:46:18:9e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::a0d8:c9ff:fe46:189e/64 scope link 
       valid_lft forever preferred_lft forever
6: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 6e:0b:b6:ef:4c:61 brd ff:ff:ff:ff:ff:ff
    inet 10.130.2.1/23 brd 10.130.3.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::6c0b:b6ff:feef:4c61/64 scope link 
       valid_lft forever preferred_lft forever
7: vethbfd58101@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether ca:43:09:2b:0e:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c843:9ff:fe2b:e06/64 scope link 
       valid_lft forever preferred_lft forever
8: vethc52f30d5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 4a:90:4e:e2:42:59 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::4890:4eff:fee2:4259/64 scope link 
       valid_lft forever preferred_lft forever
35: veth45bcdc31@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 3e:25:ca:2e:ac:8b brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::3c25:caff:fe2e:ac8b/64 scope link 
       valid_lft forever preferred_lft forever
52: veth035b24fc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether fe:8b:00:58:3a:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::fc8b:ff:fe58:3af5/64 scope link 
       valid_lft forever preferred_lft forever
53: vethda9f00e0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether ae:c2:69:fd:3e:68 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::acc2:69ff:fefd:3e68/64 scope link 
       valid_lft forever preferred_lft forever

It is true that the interfaces are down, however, it isn't a problem.  Should these be alerts?

Comment 1 Frederic Branczyk 2019-04-16 09:46:06 UTC

I talked to the SDN team and indeed this alert is too aggressive. For now we're removing it, but we're working with the SDN team to be specifying exactly which interfaces are important and only alert on those.

Comment 3 Frederic Branczyk 2019-04-16 13:36:19 UTC

The PR to remove this alert has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/326

Comment 4 Frederic Branczyk 2019-04-17 12:13:30 UTC

The above PR has been merged.

Comment 6 Junqi Zhao 2019-04-18 02:51:03 UTC

There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload

Comment 7 Junqi Zhao 2019-04-19 06:41:32 UTC

NodeNetworkInterfaceDown alert is removed from prometheus rules file
payload: 4.0.0-0.nightly-2019-04-18-190537

Comment 9 errata-xmlrpc 2019-06-04 10:47:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.