Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1518684

Summary: "ovs-vsctl show" on OCP nodes returns multiple "No such device" messages
Product: OpenShift Container Platform Reporter: Thom Carlin <tcarlin>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.0CC: aos-bugs, atragler, bbennett, dcbw, fshaikh, knakayam, ksuzumur, rhowe, weliang, xingweiyang
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 14:13:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1724792    
Attachments:
Description Flags
Log from ovs-vsctl and ovs-ofctl commands
none
node log and OPTIONS=--loglevel=5 none

Description Thom Carlin 2017-11-29 13:06:27 UTC
Description of problem:

On a fully patched OCP 3.6/CNS 3.6 cluster, receiving "No such device" messages on the nodes

Version-Release number of selected component (if applicable):

3.6

How reproducible:

100% on this cluster

Steps to Reproduce:
1. On each node: ovs-vsctl show

Actual results:

[...]
      Port "vethbcdb039b"
            Interface "vethbcdb039b"
                error: "could not open network device vethbcdb039b (No such device)"
[...]



Expected results:

list of OpenvSwitch database without errors

Additional info:

sosreports will be added in private attachments

Comment 1 Thom Carlin 2017-11-29 14:09:03 UTC
sosreports are too large for attachments

Comment 2 Weibin Liang 2017-11-30 15:41:24 UTC
Saw the same error in v3.7.9:

[root@host-172-16-120-67 ~]# ovs-vsctl show
8e6c5352-1338-4e22-ad1a-5e3a905b4159
    Bridge "br0"
        fail_mode: secure
        Port "veth6cf0fa55"
            Interface "veth6cf0fa55"
        Port "veth0bf8145d"
            Interface "veth0bf8145d"
        Port "vethe68eec9b"
            Interface "vethe68eec9b"
                error: "could not open network device vethe68eec9b (No such device)"
        Port "veth5dabac94"
            Interface "veth5dabac94"
        Port "br0"
            Interface "br0"
                type: internal
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key=flow, remote_ip=flow}
        Port "vethd6279c2b"
            Interface "vethd6279c2b"
        Port "tun0"
            Interface "tun0"
                type: internal
        Port "veth98c02cf9"
            Interface "veth98c02cf9"
    ovs_version: "2.7.3"
[root@host-172-16-120-67 ~]# oc version
oc v3.7.9
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
[root@host-172-16-120-67 ~]#

Comment 3 Dan Winship 2017-11-30 16:00:13 UTC
Weibin: can you attach the result of "ovs-ofctl -O OpenFlow13 show br0" and "ovs-ofctl -O OpenFlow13 dump-flows br0" as well?

Comment 4 Weibin Liang 2017-11-30 16:18:46 UTC
Created attachment 1360987 [details]
Log from ovs-vsctl and ovs-ofctl commands

Comment 5 Dan Winship 2017-11-30 16:50:22 UTC
OK, so "ovs-ofctl show" shows veths attached to ports 4, 8, 10, 12, and 13, but "ovs-ofctl dump" shows flows for ports 4, 7, 8, 10, 12, and 13. Meaning, we still have a flow for port 7 despite not having a veth attached to it, presumably corresponding to the missing veth in the "ovs-vsctl" output.

So, this is some sort of pod cleanup error. Possibly related to bug 1518912.

Weibin: can you put the atomic-openshift-node logs for this node somewhere? As far back as they go on this node. (And let me know what loglevel they're at.)

Comment 6 Thom Carlin 2017-11-30 18:29:53 UTC
Although there is no evidence either way that this error causes any other issues, 
a workaround supplied by Dan removes these messages:

1) oadm drain <<node_name>>
2) Reboot node
3) oadm uncordon <<node_name>>

Note that you must have sufficient capacity in your cluster to absorb the containers evacuated from the node.

Comment 7 Weibin Liang 2017-11-30 18:47:39 UTC
Created attachment 1361101 [details]
node log and OPTIONS=--loglevel=5

Comment 12 Weibin Liang 2018-02-09 14:52:27 UTC
Tested and verified on v3.9.0.-0.41.0

[root@host-172-16-120-139 Sanity-Test]# ovs-vsctl show
451601d1-2b65-4e88-8be4-189491cdd333
    Bridge "br0"
        fail_mode: secure
        Port "vethf90cbbbf"
            Interface "vethf90cbbbf"
        Port "veth06984ca2"
            Interface "veth06984ca2"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key=flow, remote_ip=flow}
        Port "tun0"
            Interface "tun0"
                type: internal
        Port "vethf35a42c9"
            Interface "vethf35a42c9"
        Port "vethe1ee7155"
            Interface "vethe1ee7155"
        Port "br0"
            Interface "br0"
                type: internal
        Port "veth65346a6c"
            Interface "veth65346a6c"
        Port "veth65a33588"
            Interface "veth65a33588"
        Port "veth573462cb"
            Interface "veth573462cb"
    ovs_version: "2.7.3"
[root@host-172-16-120-139 Sanity-Test]# 
[root@host-172-16-120-139 Sanity-Test]# 
[root@host-172-16-120-139 Sanity-Test]# oc version
oc v3.9.0-0.41.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.16.120.139:8443
openshift v3.9.0-0.41.0
kubernetes v1.9.1+a0ce1bc657

Comment 15 errata-xmlrpc 2018-03-28 14:13:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 16 xingweiyang 2018-05-30 01:08:02 UTC
still foud in ocp 3.9.14

Comment 17 Kenjiro Nakayama 2018-12-18 04:18:41 UTC
@Dan @Ben,

One of our TAM partners requested to backport this to 3.7.x. Although they know that rebooting the host could be the workaround, it is difficult to accept it. 
Could you please consider to backport the fix to 3.7.x? If not possible, we need to explain that this issue is completely harmless, so can you advice us? (e.g - They already observed that a bunch of useless port/flow rules of openvswitch remains on each nodes. It will not hit any limit?)

Comment 18 Ryan Howe 2019-01-19 01:13:17 UTC
Adding more info regarding the issue.


Hit issue in OCP 3.7.72-1

Where we see about 666 ports showing this:

        Port "veth942fc505"
            Interface "veth942fc505"
                error: "could not open network device veth942fc505 (No such device)"
        Port "veth14fb4836"
            Interface "veth14fb4836"
                error: "could not open network device veth14fb4836 (No such device)"
    ovs_version: "2.9.0"


What ended up happing the sdn created an eth0 and failed to place it in the container.

# ip -s link | grep 960
1199: veth34fe960f@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP mode DEFAULT 
1200: eth0@veth34fe960f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT 

From there the default was set to eth0 and the IP (SDN CIDR ADDR). Node became not ready to due to failing to connect to master. 

Rebooting the machine works around the issue.

Comment 19 Red Hat Bugzilla 2023-09-14 04:12:46 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days