Bug 1700057

Summary: NodeNetworkInterfaceDown alerts probably should not be alerts
Product: OpenShift Container Platform Reporter: Seth Jennings <sjenning>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: anpicker, erooth, mloibl, pkrupa, pweil, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
prom-node-network-up.png none

Description Seth Jennings 2019-04-15 17:22:33 UTC
Created attachment 1555287 [details]
prom-node-network-up.png

node-exporter is reporting ~5 interfaces per node in my cluster are down triggering 35 alerts in alert manager (and the console).

All nodes are healthy

$ ogcv
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-04-15-011801   True        False         3h29m   Cluster version is 4.0.0-0.ci-2019-04-15-011801

$ oc get nodes
NAME       STATUS   ROLES    AGE     VERSION
master-0   Ready    master   3h45m   v1.13.4+412644ac7
master-1   Ready    master   3h45m   v1.13.4+412644ac7
master-2   Ready    master   3h45m   v1.13.4+412644ac7
worker-0   Ready    worker   87m     v1.13.4+412644ac7
worker-1   Ready    worker   3h44m   v1.13.4+412644ac7
worker-2   Ready    worker   3h45m   v1.13.4+412644ac7

$ oc project openshift-sdn
Now using project "openshift-sdn" on server "https://api.lab.variantweb.net:6443".
[sjennings@cerebellum ~]$ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
ovs-4qlpt              1/1     Running   0          3h45m
ovs-5zlhm              1/1     Running   0          3h45m
ovs-7v42x              1/1     Running   0          3h45m
ovs-9n8cg              1/1     Running   0          3h45m
ovs-dlj92              1/1     Running   0          3h45m
ovs-sz7lc              1/1     Running   0          87m
sdn-controller-4g8bq   1/1     Running   0          3h45m
sdn-controller-fxvm9   1/1     Running   0          3h45m
sdn-controller-js7mj   1/1     Running   0          3h45m
sdn-cx6n9              1/1     Running   0          87m
sdn-jvjr4              1/1     Running   0          3h45m
sdn-m7ks6              1/1     Running   0          3h45m
sdn-qp7pl              1/1     Running   0          3h45m
sdn-th7q5              1/1     Running   0          3h45m
sdn-vlfdk              1/1     Running   0          3h45m

Logging into a worker

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:5e:f9:72 brd ff:ff:ff:ff:ff:ff
    inet 10.42.10.213/24 brd 10.42.10.255 scope global dynamic noprefixroute ens3
       valid_lft 81087sec preferred_lft 81087sec
    inet6 fe80::eebb:b0d5:76f6:606a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:69:8f:50:35:14 brd ff:ff:ff:ff:ff:ff
4: br0: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default qlen 1000
    link/ether 1a:7a:b0:33:41:46 brd ff:ff:ff:ff:ff:ff
5: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether a2:d8:c9:46:18:9e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::a0d8:c9ff:fe46:189e/64 scope link 
       valid_lft forever preferred_lft forever
6: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 6e:0b:b6:ef:4c:61 brd ff:ff:ff:ff:ff:ff
    inet 10.130.2.1/23 brd 10.130.3.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::6c0b:b6ff:feef:4c61/64 scope link 
       valid_lft forever preferred_lft forever
7: vethbfd58101@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether ca:43:09:2b:0e:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c843:9ff:fe2b:e06/64 scope link 
       valid_lft forever preferred_lft forever
8: vethc52f30d5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 4a:90:4e:e2:42:59 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::4890:4eff:fee2:4259/64 scope link 
       valid_lft forever preferred_lft forever
35: veth45bcdc31@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 3e:25:ca:2e:ac:8b brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::3c25:caff:fe2e:ac8b/64 scope link 
       valid_lft forever preferred_lft forever
52: veth035b24fc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether fe:8b:00:58:3a:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::fc8b:ff:fe58:3af5/64 scope link 
       valid_lft forever preferred_lft forever
53: vethda9f00e0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether ae:c2:69:fd:3e:68 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::acc2:69ff:fefd:3e68/64 scope link 
       valid_lft forever preferred_lft forever

It is true that the interfaces are down, however, it isn't a problem.  Should these be alerts?

Comment 1 Frederic Branczyk 2019-04-16 09:46:06 UTC
I talked to the SDN team and indeed this alert is too aggressive. For now we're removing it, but we're working with the SDN team to be specifying exactly which interfaces are important and only alert on those.

Comment 3 Frederic Branczyk 2019-04-16 13:36:19 UTC
The PR to remove this alert has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/326

Comment 4 Frederic Branczyk 2019-04-17 12:13:30 UTC
The above PR has been merged.

Comment 6 Junqi Zhao 2019-04-18 02:51:03 UTC
There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload

Comment 7 Junqi Zhao 2019-04-19 06:41:32 UTC
NodeNetworkInterfaceDown alert is removed from prometheus rules file
payload: 4.0.0-0.nightly-2019-04-18-190537

Comment 9 errata-xmlrpc 2019-06-04 10:47:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758