Bug 1700057 - NodeNetworkInterfaceDown alerts probably should not be alerts
Summary: NodeNetworkInterfaceDown alerts probably should not be alerts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.1.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-15 17:22 UTC by Seth Jennings
Modified: 2019-06-04 10:47 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prom-node-network-up.png (224.88 KB, image/png)
2019-04-15 17:22 UTC, Seth Jennings
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:47:43 UTC

Description Seth Jennings 2019-04-15 17:22:33 UTC
Created attachment 1555287 [details]
prom-node-network-up.png

node-exporter is reporting ~5 interfaces per node in my cluster are down triggering 35 alerts in alert manager (and the console).

All nodes are healthy

$ ogcv
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-04-15-011801   True        False         3h29m   Cluster version is 4.0.0-0.ci-2019-04-15-011801

$ oc get nodes
NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   3h45m   v1.13.4+412644ac7
master-1   Ready    master   3h45m   v1.13.4+412644ac7
master-2   Ready    master   3h45m   v1.13.4+412644ac7
worker-0   Ready    worker   87m     v1.13.4+412644ac7
worker-1   Ready    worker   3h44m   v1.13.4+412644ac7
worker-2   Ready    worker   3h45m   v1.13.4+412644ac7

$ oc project openshift-sdn
Now using project "openshift-sdn" on server "https://api.lab.variantweb.net:6443".
[sjennings@cerebellum ~]$ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
ovs-4qlpt              1/1     Running   0          3h45m
ovs-5zlhm              1/1     Running   0          3h45m
ovs-7v42x              1/1     Running   0          3h45m
ovs-9n8cg              1/1     Running   0          3h45m
ovs-dlj92              1/1     Running   0          3h45m
ovs-sz7lc              1/1     Running   0          87m
sdn-controller-4g8bq   1/1     Running   0          3h45m
sdn-controller-fxvm9   1/1     Running   0          3h45m
sdn-controller-js7mj   1/1     Running   0          3h45m
sdn-cx6n9              1/1     Running   0          87m
sdn-jvjr4              1/1     Running   0          3h45m
sdn-m7ks6              1/1     Running   0          3h45m
sdn-qp7pl              1/1     Running   0          3h45m
sdn-th7q5              1/1     Running   0          3h45m
sdn-vlfdk              1/1     Running   0          3h45m

Logging into a worker

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:5e:f9:72 brd ff:ff:ff:ff:ff:ff
    inet 10.42.10.213/24 brd 10.42.10.255 scope global dynamic noprefixroute ens3
       valid_lft 81087sec preferred_lft 81087sec
    inet6 fe80::eebb:b0d5:76f6:606a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:69:8f:50:35:14 brd ff:ff:ff:ff:ff:ff
4: br0: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:7a:b0:33:41:46 brd ff:ff:ff:ff:ff:ff
5: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether a2:d8:c9:46:18:9e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::a0d8:c9ff:fe46:189e/64 scope link 
       valid_lft forever preferred_lft forever
6: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 6e:0b:b6:ef:4c:61 brd ff:ff:ff:ff:ff:ff
    inet 10.130.2.1/23 brd 10.130.3.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::6c0b:b6ff:feef:4c61/64 scope link 
       valid_lft forever preferred_lft forever
7: vethbfd58101@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether ca:43:09:2b:0e:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c843:9ff:fe2b:e06/64 scope link 
       valid_lft forever preferred_lft forever
8: vethc52f30d5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 4a:90:4e:e2:42:59 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::4890:4eff:fee2:4259/64 scope link 
       valid_lft forever preferred_lft forever
35: veth45bcdc31@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 3e:25:ca:2e:ac:8b brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::3c25:caff:fe2e:ac8b/64 scope link 
       valid_lft forever preferred_lft forever
52: veth035b24fc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether fe:8b:00:58:3a:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::fc8b:ff:fe58:3af5/64 scope link 
       valid_lft forever preferred_lft forever
53: vethda9f00e0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether ae:c2:69:fd:3e:68 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::acc2:69ff:fefd:3e68/64 scope link 
       valid_lft forever preferred_lft forever

It is true that the interfaces are down, however, it isn't a problem.  Should these be alerts?

Comment 1 Frederic Branczyk 2019-04-16 09:46:06 UTC
I talked to the SDN team and indeed this alert is too aggressive. For now we're removing it, but we're working with the SDN team to be specifying exactly which interfaces are important and only alert on those.

Comment 3 Frederic Branczyk 2019-04-16 13:36:19 UTC
The PR to remove this alert has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/326

Comment 4 Frederic Branczyk 2019-04-17 12:13:30 UTC
The above PR has been merged.

Comment 6 Junqi Zhao 2019-04-18 02:51:03 UTC
There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload

Comment 7 Junqi Zhao 2019-04-19 06:41:32 UTC
NodeNetworkInterfaceDown alert is removed from prometheus rules file
payload: 4.0.0-0.nightly-2019-04-18-190537

Comment 9 errata-xmlrpc 2019-06-04 10:47:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.