Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1518684

Summary:

"ovs-vsctl show" on OCP nodes returns multiple "No such device" messages

Product:

OpenShift Container Platform

Reporter:

Thom Carlin <tcarlin>

Component:

Networking

Assignee:

Dan Williams <dcbw>

Status:

CLOSED ERRATA

QA Contact:

Meng Bo <bmeng>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.6.0

CC:

aos-bugs, atragler, bbennett, dcbw, fshaikh, knakayam, ksuzumur, rhowe, weliang, xingweiyang

Target Milestone:

---

Target Release:

3.9.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-03-28 14:13:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1724792

Attachments:

Description	Flags
Log from ovs-vsctl and ovs-ofctl commands	none
node log and OPTIONS=--loglevel=5	none

Description Thom Carlin 2017-11-29 13:06:27 UTC

Description of problem:

On a fully patched OCP 3.6/CNS 3.6 cluster, receiving "No such device" messages on the nodes

Version-Release number of selected component (if applicable):

3.6

How reproducible:

100% on this cluster

Steps to Reproduce:
1. On each node: ovs-vsctl show

Actual results:

[...]
      Port "vethbcdb039b"
            Interface "vethbcdb039b"
                error: "could not open network device vethbcdb039b (No such device)"
[...]



Expected results:

list of OpenvSwitch database without errors

Additional info:

sosreports will be added in private attachments

Comment 1 Thom Carlin 2017-11-29 14:09:03 UTC

sosreports are too large for attachments

Comment 2 Weibin Liang 2017-11-30 15:41:24 UTC

Saw the same error in v3.7.9:

[root@host-172-16-120-67 ~]# ovs-vsctl show
8e6c5352-1338-4e22-ad1a-5e3a905b4159
    Bridge "br0"
        fail_mode: secure
        Port "veth6cf0fa55"
            Interface "veth6cf0fa55"
        Port "veth0bf8145d"
            Interface "veth0bf8145d"
        Port "vethe68eec9b"
            Interface "vethe68eec9b"
                error: "could not open network device vethe68eec9b (No such device)"
        Port "veth5dabac94"
            Interface "veth5dabac94"
        Port "br0"
            Interface "br0"
                type: internal
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key=flow, remote_ip=flow}
        Port "vethd6279c2b"
            Interface "vethd6279c2b"
        Port "tun0"
            Interface "tun0"
                type: internal
        Port "veth98c02cf9"
            Interface "veth98c02cf9"
    ovs_version: "2.7.3"
[root@host-172-16-120-67 ~]# oc version
oc v3.7.9
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
[root@host-172-16-120-67 ~]#

Comment 3 Dan Winship 2017-11-30 16:00:13 UTC

Weibin: can you attach the result of "ovs-ofctl -O OpenFlow13 show br0" and "ovs-ofctl -O OpenFlow13 dump-flows br0" as well?

Comment 4 Weibin Liang 2017-11-30 16:18:46 UTC

Created attachment 1360987 [details]
Log from ovs-vsctl and ovs-ofctl commands

Comment 5 Dan Winship 2017-11-30 16:50:22 UTC

OK, so "ovs-ofctl show" shows veths attached to ports 4, 8, 10, 12, and 13, but "ovs-ofctl dump" shows flows for ports 4, 7, 8, 10, 12, and 13. Meaning, we still have a flow for port 7 despite not having a veth attached to it, presumably corresponding to the missing veth in the "ovs-vsctl" output.

So, this is some sort of pod cleanup error. Possibly related to bug 1518912.

Weibin: can you put the atomic-openshift-node logs for this node somewhere? As far back as they go on this node. (And let me know what loglevel they're at.)

Comment 6 Thom Carlin 2017-11-30 18:29:53 UTC

Although there is no evidence either way that this error causes any other issues, 
a workaround supplied by Dan removes these messages:

1) oadm drain <<node_name>>
2) Reboot node
3) oadm uncordon <<node_name>>

Note that you must have sufficient capacity in your cluster to absorb the containers evacuated from the node.

Comment 7 Weibin Liang 2017-11-30 18:47:39 UTC

Created attachment 1361101 [details]
node log and OPTIONS=--loglevel=5

Comment 12 Weibin Liang 2018-02-09 14:52:27 UTC

Tested and verified on v3.9.0.-0.41.0

[root@host-172-16-120-139 Sanity-Test]# ovs-vsctl show
451601d1-2b65-4e88-8be4-189491cdd333
    Bridge "br0"
        fail_mode: secure
        Port "vethf90cbbbf"
            Interface "vethf90cbbbf"
        Port "veth06984ca2"
            Interface "veth06984ca2"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key=flow, remote_ip=flow}
        Port "tun0"
            Interface "tun0"
                type: internal
        Port "vethf35a42c9"
            Interface "vethf35a42c9"
        Port "vethe1ee7155"
            Interface "vethe1ee7155"
        Port "br0"
            Interface "br0"
                type: internal
        Port "veth65346a6c"
            Interface "veth65346a6c"
        Port "veth65a33588"
            Interface "veth65a33588"
        Port "veth573462cb"
            Interface "veth573462cb"
    ovs_version: "2.7.3"
[root@host-172-16-120-139 Sanity-Test]# 
[root@host-172-16-120-139 Sanity-Test]# 
[root@host-172-16-120-139 Sanity-Test]# oc version
oc v3.9.0-0.41.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.16.120.139:8443
openshift v3.9.0-0.41.0
kubernetes v1.9.1+a0ce1bc657

Comment 15 errata-xmlrpc 2018-03-28 14:13:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 16 xingweiyang 2018-05-30 01:08:02 UTC

still foud in ocp 3.9.14

Comment 17 Kenjiro Nakayama 2018-12-18 04:18:41 UTC

@Dan @Ben,

One of our TAM partners requested to backport this to 3.7.x. Although they know that rebooting the host could be the workaround, it is difficult to accept it. 
Could you please consider to backport the fix to 3.7.x? If not possible, we need to explain that this issue is completely harmless, so can you advice us? (e.g - They already observed that a bunch of useless port/flow rules of openvswitch remains on each nodes. It will not hit any limit?)

Comment 18 Ryan Howe 2019-01-19 01:13:17 UTC

Adding more info regarding the issue.


Hit issue in OCP 3.7.72-1

Where we see about 666 ports showing this:

        Port "veth942fc505"
            Interface "veth942fc505"
                error: "could not open network device veth942fc505 (No such device)"
        Port "veth14fb4836"
            Interface "veth14fb4836"
                error: "could not open network device veth14fb4836 (No such device)"
    ovs_version: "2.9.0"


What ended up happing the sdn created an eth0 and failed to place it in the container.

# ip -s link | grep 960
1199: veth34fe960f@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP mode DEFAULT 
1200: eth0@veth34fe960f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT 

From there the default was set to eth0 and the IP (SDN CIDR ADDR). Node became not ready to due to failing to connect to master. 

Rebooting the machine works around the issue.

Comment 19 Red Hat Bugzilla 2023-09-14 04:12:46 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days