Hide Forgot
Created attachment 1555287 [details] prom-node-network-up.png node-exporter is reporting ~5 interfaces per node in my cluster are down triggering 35 alerts in alert manager (and the console). All nodes are healthy $ ogcv NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.ci-2019-04-15-011801 True False 3h29m Cluster version is 4.0.0-0.ci-2019-04-15-011801 $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master 3h45m v1.13.4+412644ac7 master-1 Ready master 3h45m v1.13.4+412644ac7 master-2 Ready master 3h45m v1.13.4+412644ac7 worker-0 Ready worker 87m v1.13.4+412644ac7 worker-1 Ready worker 3h44m v1.13.4+412644ac7 worker-2 Ready worker 3h45m v1.13.4+412644ac7 $ oc project openshift-sdn Now using project "openshift-sdn" on server "https://api.lab.variantweb.net:6443". [sjennings@cerebellum ~]$ oc get pods NAME READY STATUS RESTARTS AGE ovs-4qlpt 1/1 Running 0 3h45m ovs-5zlhm 1/1 Running 0 3h45m ovs-7v42x 1/1 Running 0 3h45m ovs-9n8cg 1/1 Running 0 3h45m ovs-dlj92 1/1 Running 0 3h45m ovs-sz7lc 1/1 Running 0 87m sdn-controller-4g8bq 1/1 Running 0 3h45m sdn-controller-fxvm9 1/1 Running 0 3h45m sdn-controller-js7mj 1/1 Running 0 3h45m sdn-cx6n9 1/1 Running 0 87m sdn-jvjr4 1/1 Running 0 3h45m sdn-m7ks6 1/1 Running 0 3h45m sdn-qp7pl 1/1 Running 0 3h45m sdn-th7q5 1/1 Running 0 3h45m sdn-vlfdk 1/1 Running 0 3h45m Logging into a worker 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether fa:16:3e:5e:f9:72 brd ff:ff:ff:ff:ff:ff inet 10.42.10.213/24 brd 10.42.10.255 scope global dynamic noprefixroute ens3 valid_lft 81087sec preferred_lft 81087sec inet6 fe80::eebb:b0d5:76f6:606a/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 1a:69:8f:50:35:14 brd ff:ff:ff:ff:ff:ff 4: br0: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default qlen 1000 link/ether 1a:7a:b0:33:41:46 brd ff:ff:ff:ff:ff:ff 5: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000 link/ether a2:d8:c9:46:18:9e brd ff:ff:ff:ff:ff:ff inet6 fe80::a0d8:c9ff:fe46:189e/64 scope link valid_lft forever preferred_lft forever 6: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 6e:0b:b6:ef:4c:61 brd ff:ff:ff:ff:ff:ff inet 10.130.2.1/23 brd 10.130.3.255 scope global tun0 valid_lft forever preferred_lft forever inet6 fe80::6c0b:b6ff:feef:4c61/64 scope link valid_lft forever preferred_lft forever 7: vethbfd58101@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default link/ether ca:43:09:2b:0e:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::c843:9ff:fe2b:e06/64 scope link valid_lft forever preferred_lft forever 8: vethc52f30d5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default link/ether 4a:90:4e:e2:42:59 brd ff:ff:ff:ff:ff:ff link-netnsid 1 inet6 fe80::4890:4eff:fee2:4259/64 scope link valid_lft forever preferred_lft forever 35: veth45bcdc31@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default link/ether 3e:25:ca:2e:ac:8b brd ff:ff:ff:ff:ff:ff link-netnsid 4 inet6 fe80::3c25:caff:fe2e:ac8b/64 scope link valid_lft forever preferred_lft forever 52: veth035b24fc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default link/ether fe:8b:00:58:3a:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 2 inet6 fe80::fc8b:ff:fe58:3af5/64 scope link valid_lft forever preferred_lft forever 53: vethda9f00e0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default link/ether ae:c2:69:fd:3e:68 brd ff:ff:ff:ff:ff:ff link-netnsid 3 inet6 fe80::acc2:69ff:fefd:3e68/64 scope link valid_lft forever preferred_lft forever It is true that the interfaces are down, however, it isn't a problem. Should these be alerts?
I talked to the SDN team and indeed this alert is too aggressive. For now we're removing it, but we're working with the SDN team to be specifying exactly which interfaces are important and only alert on those.
The PR to remove this alert has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/326
The above PR has been merged.
There is not available OCP payload which packages the fix to test, so postpone the testing until we have available payload
NodeNetworkInterfaceDown alert is removed from prometheus rules file payload: 4.0.0-0.nightly-2019-04-18-190537
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758