Description of problem: On a fully patched OCP 3.6/CNS 3.6 cluster, receiving "No such device" messages on the nodes Version-Release number of selected component (if applicable): 3.6 How reproducible: 100% on this cluster Steps to Reproduce: 1. On each node: ovs-vsctl show Actual results: [...] Port "vethbcdb039b" Interface "vethbcdb039b" error: "could not open network device vethbcdb039b (No such device)" [...] Expected results: list of OpenvSwitch database without errors Additional info: sosreports will be added in private attachments
sosreports are too large for attachments
Saw the same error in v3.7.9: [root@host-172-16-120-67 ~]# ovs-vsctl show 8e6c5352-1338-4e22-ad1a-5e3a905b4159 Bridge "br0" fail_mode: secure Port "veth6cf0fa55" Interface "veth6cf0fa55" Port "veth0bf8145d" Interface "veth0bf8145d" Port "vethe68eec9b" Interface "vethe68eec9b" error: "could not open network device vethe68eec9b (No such device)" Port "veth5dabac94" Interface "veth5dabac94" Port "br0" Interface "br0" type: internal Port "vxlan0" Interface "vxlan0" type: vxlan options: {key=flow, remote_ip=flow} Port "vethd6279c2b" Interface "vethd6279c2b" Port "tun0" Interface "tun0" type: internal Port "veth98c02cf9" Interface "veth98c02cf9" ovs_version: "2.7.3" [root@host-172-16-120-67 ~]# oc version oc v3.7.9 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO [root@host-172-16-120-67 ~]#
Weibin: can you attach the result of "ovs-ofctl -O OpenFlow13 show br0" and "ovs-ofctl -O OpenFlow13 dump-flows br0" as well?
Created attachment 1360987 [details] Log from ovs-vsctl and ovs-ofctl commands
OK, so "ovs-ofctl show" shows veths attached to ports 4, 8, 10, 12, and 13, but "ovs-ofctl dump" shows flows for ports 4, 7, 8, 10, 12, and 13. Meaning, we still have a flow for port 7 despite not having a veth attached to it, presumably corresponding to the missing veth in the "ovs-vsctl" output. So, this is some sort of pod cleanup error. Possibly related to bug 1518912. Weibin: can you put the atomic-openshift-node logs for this node somewhere? As far back as they go on this node. (And let me know what loglevel they're at.)
Although there is no evidence either way that this error causes any other issues, a workaround supplied by Dan removes these messages: 1) oadm drain <<node_name>> 2) Reboot node 3) oadm uncordon <<node_name>> Note that you must have sufficient capacity in your cluster to absorb the containers evacuated from the node.
Created attachment 1361101 [details] node log and OPTIONS=--loglevel=5
Tested and verified on v3.9.0.-0.41.0 [root@host-172-16-120-139 Sanity-Test]# ovs-vsctl show 451601d1-2b65-4e88-8be4-189491cdd333 Bridge "br0" fail_mode: secure Port "vethf90cbbbf" Interface "vethf90cbbbf" Port "veth06984ca2" Interface "veth06984ca2" Port "vxlan0" Interface "vxlan0" type: vxlan options: {key=flow, remote_ip=flow} Port "tun0" Interface "tun0" type: internal Port "vethf35a42c9" Interface "vethf35a42c9" Port "vethe1ee7155" Interface "vethe1ee7155" Port "br0" Interface "br0" type: internal Port "veth65346a6c" Interface "veth65346a6c" Port "veth65a33588" Interface "veth65a33588" Port "veth573462cb" Interface "veth573462cb" ovs_version: "2.7.3" [root@host-172-16-120-139 Sanity-Test]# [root@host-172-16-120-139 Sanity-Test]# [root@host-172-16-120-139 Sanity-Test]# oc version oc v3.9.0-0.41.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://172.16.120.139:8443 openshift v3.9.0-0.41.0 kubernetes v1.9.1+a0ce1bc657
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489
still foud in ocp 3.9.14
@Dan @Ben, One of our TAM partners requested to backport this to 3.7.x. Although they know that rebooting the host could be the workaround, it is difficult to accept it. Could you please consider to backport the fix to 3.7.x? If not possible, we need to explain that this issue is completely harmless, so can you advice us? (e.g - They already observed that a bunch of useless port/flow rules of openvswitch remains on each nodes. It will not hit any limit?)
Adding more info regarding the issue. Hit issue in OCP 3.7.72-1 Where we see about 666 ports showing this: Port "veth942fc505" Interface "veth942fc505" error: "could not open network device veth942fc505 (No such device)" Port "veth14fb4836" Interface "veth14fb4836" error: "could not open network device veth14fb4836 (No such device)" ovs_version: "2.9.0" What ended up happing the sdn created an eth0 and failed to place it in the container. # ip -s link | grep 960 1199: veth34fe960f@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP mode DEFAULT 1200: eth0@veth34fe960f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT From there the default was set to eth0 and the IP (SDN CIDR ADDR). Node became not ready to due to failing to connect to master. Rebooting the machine works around the issue.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days