Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1611044

Summary: Guest fails to start sometimes when an interface name is already occupied by ovs port and the interface doesn't exist actually
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Fangge Jin <fjin>
Component: libvirtAssignee: Laine Stump <laine>
Status: CLOSED WONTFIX QA Contact: yalzhang <yalzhang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.0CC: dyuan, laine, lmen, qguo, rbalakri, xuzhang, yalzhang
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-15 07:41:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
syslog and libvirtd.log none

Description Fangge Jin 2018-08-02 02:09:47 UTC
Created attachment 1472215 [details]
syslog and libvirtd.log

Description of problem:
Guest fails to start occasionally when an interface name is already occupied by ovs port and the interface doesn't exist actually

Version-Release number of selected component (if applicable):
libvirt-4.5.0-6.virtcov.el7.x86_64
kernel-3.10.0-924.el7.x86_64
qemu-kvm-rhev-2.12.0-8.el7.x86_64
openvswitch-2.9.0-19.el7fdp.x86_64
NetworkManager-1.12.0-1.el7.x86_64

How reproducible:
About 30% in my env

Steps to Reproduce:
1.Prepare a ovs and add a non-existing port to it:
# systemctl start openvswitch
# ovs-vsctl add-br ovsbr0
# ovs-vsctl add-port ovsbr0 vnet6
# ip a |grep vnet
  (In the output, vnet0~5 exist, vnet6 doesn't exist)
# ovs-vsctl show
9ae0e2e4-bf4b-4c60-8227-7e3f64dac912
    Bridge "ovsbr0"
...
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
...
        Port "vnet6"
            Interface "vnet6"
                error: "could not open network device vnet6 (No such device)"

2.Prepare a guest with an interface:
# virsh domiflist rhel7-min-1 
Interface  Type       Source     Model       MAC
-------------------------------------------------------
-          network    default    virtio      52:54:00:b4:e4:88

3.Start&&destroy guest in a loop, guest fails to start sometimes:
# while true; do virsh start rhel7-min-1; sleep 1; virsh destroy rhel7-min-1; done
Domain rhel7-min-1 started

Domain rhel7-min-1 destroyed

error: Failed to start domain rhel7-min-1
error: Unable to add bridge virbr0 port vnet6: Device or resource busy

error: Failed to destroy domain rhel7-min-1
error: Requested operation is not valid: domain is not running

Domain rhel7-min-1 started

Domain rhel7-min-1 destroyed

Domain rhel7-min-1 started

Domain rhel7-min-1 destroyed

Domain rhel7-min-1 started

Domain rhel7-min-1 destroyed

Domain rhel7-min-1 started

Domain rhel7-min-1 destroyed

error: Failed to start domain rhel7-min-1
error: Unable to add bridge virbr0 port vnet6: Device or resource busy

error: Failed to destroy domain rhel7-min-1
error: Requested operation is not valid: domain is not running

Domain rhel7-min-1 started

Domain rhel7-min-1 destroyed

error: Failed to start domain rhel7-min-1
error: Unable to add bridge virbr0 port vnet6: Device or resource busy

error: Failed to destroy domain rhel7-min-1
error: Requested operation is not valid: domain is not running

4.When guest starts successfully, check ovs:
# ovs-vsctl show
9ae0e2e4-bf4b-4c60-8227-7e3f64dac912
    Bridge "ovsbr0"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
 ...
        Port "vnet6"
            Interface "vnet6"
                error: "could not add network device vnet6 to ofproto (Device or resource busy)"
...


Actual results:
As step3, guest fails to start sometimes.

Expected results:
Guest can always start successfully

Additional info:

Comment 2 Laine Stump 2018-08-02 22:39:59 UTC
When you say you get this 30% of the time, do you mean under normal operation *without* artificially/manually adding the port for the not-yet-existing device to the OVS bridge? If that's the case, then we should be looking for what is the chain of events that causes the port to already be created (most probably the teardown of a previous guest's plumbing was interrupted/incomplete.

Beyond that, it will probably be okay to check for an existing connection and remove it prior to adding the port to the new bridge (I recently added an internal function to libvirt that can report the current master on an OVS bridge for a tap device - virNetDevOpenvswitchInterfaceGetMaster(). It should probably be used in combination with virNetDevGetMaster(), similar to whats done in bridge_driver.c)

Comment 3 Fangge Jin 2018-08-03 01:54:59 UTC
(In reply to Laine Stump from comment #2)
> When you say you get this 30% of the time, do you mean under normal
> operation *without* artificially/manually adding the port for the
> not-yet-existing device to the OVS bridge? If that's the case, then we
> should be looking for what is the chain of events that causes the port to
> already be created (most probably the teardown of a previous guest's
> plumbing was interrupted/incomplete.
> 
At first, I didn't *manually* add the port for the not-yet-existing device to the OVS bridge, and guest failed to start with such error "Unable to add bridge virbr0 port vnet6: Device or resource busy" occasionally. I checked ovs-vsctl and see the not-yet-existing device, it was from a previous guest that has been shutoff definitely. But I don't know how to make this happen again.

Then I found I can *manually* add a such port and reproduce the issue that guest fails to start.

For 30%, I mean I *manually* add a such port, and guest fails to start at 30% possibility.

I will do more test and try to see whether I can find a way that makes guest shutoff but leaves a non-existing device attached to OVS bridge.

> Beyond that, it will probably be okay to check for an existing connection
> and remove it prior to adding the port to the new bridge (I recently added
> an internal function to libvirt that can report the current master on an OVS
> bridge for a tap device - virNetDevOpenvswitchInterfaceGetMaster(). It
> should probably be used in combination with virNetDevGetMaster(), similar to
> whats done in bridge_driver.c)
So virNetDevOpenvswitchInterfaceGetMaster() will resolve the issue that "*manually* add a such port, and guest fails to start", right?

Comment 4 Laine Stump 2018-08-03 18:06:18 UTC
(In reply to Fangge Jin from comment #3)
> So virNetDevOpenvswitchInterfaceGetMaster() will resolve the issue that
> "*manually* add a such port, and guest fails to start", right?


That was just a note for whoever implements the fix, in case it isn't me. That function can be used to learn if there is already a port by that name and, if so, what bridge it is attached to. We would call that function prior to attempting to attach the port to the bridge, then use a *different* function to remove any pre-existing connection prior to attaching it to the desired bridge.

It would still be good to know about any possible circumstance where an orphaned port connection might remain. We should try to prevent that as much as possible.

Comment 5 Fangge Jin 2018-08-04 00:55:16 UTC
I just found a method to "make guest shutoff with ovs port left", but the scenario seems a little corner:

1. Prepare a running guest with an interface(vnet0 in my case) attached to ovs bridge:
# virsh domiflist rhel7-min
Interface     Type     Source        Model      MAC
-------------------------------------------------------
vnet0      bridge     ovs-net    rtl8139     52:54:00:3c:f8:25
vnet1      network    default    virtio      52:54:00:f1:2c:ed

2. Stop openvswitch.service:
# systemctl stop openvswitch.service

3. Destroy guest:
# virsh destroy rhel7-min

4. Start openvswitch.service:
# systemctl start openvswitch.service

5. Check ovs bridge, vnet0 is shown:
# ovs-vsctl show
9ae0e2e4-bf4b-4c60-8227-7e3f64dac912
    Bridge "ovsbr0"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
        Port "vnet2"
            Interface "vnet2"
                error: "could not open network device vnet2 (No such device)"
        Port "vnet3"
            Interface "vnet3"
                error: "could not open network device vnet3 (No such device)"
        Port "vnet6"
            Interface "vnet6"
                error: "could not open network device vnet6 (No such device)"
        Port "vnet5"
            Interface "vnet5"
                error: "could not open network device vnet5 (No such device)"
        Port "vnet0"
            Interface "vnet0"
                error: "could not open network device vnet0 (No such device)"

6. # ip a |grep vnet0
(no output)

Comment 6 Laine Stump 2019-10-24 13:32:13 UTC
There is a further twist to this that I recently discovered - if a tap device has been previously attached to an OVS bridge, and then for some reason didn't get detached (maybe someone did it manually, maybe libvirt crashed, etc), then in the future any time a new tap device with that same name is created, it will be create attached to the OVS bridge. *This persists across reboots of the host system!*

Here's how to make this happen:

ip tuntap add mode tap name vnet0
vs-vsctl add-port ovsbr0 vnet0
ip link del vnet0

After doing this, vnet0 no longer exists, but the OVS db still has an entry connecting it to ovsbr0. Any time vnet0 is re-created, it will show "master ovs-system" in the output of "ip link show vnet0".

As recently as a month ago, when I tried to start a guest that created, e.g. vnet0, and the guest wanted to attach it to the host bridge br0, but it was still registered as attached to an OVS bridge (see above for how to setup that scenario), then the tap device creation/attach would fail with something like "Device in Use" (I've forgotten the exact error log, and can't reproduce it now).

But today when I do that with libvirt from git master, on a Fedora 30 system with kernel 5.2.17-200.fc30.x86_64, the guest startup *succeeds* (and the tap device is attached to the proper bridge). On the other hand, if I attempt to do the same thing manually (using "ip tuntap add...; brctl addif virbr0 vnet0")  it fails.

I've looked through the most recent code and see that we simply call ioctl(fd, SIOCBRADDIF,...), and therefore we should be failing.

Can you retest this on your system and see if you're still able to make it fail with up-to-date packages?

Comment 7 Laine Stump 2019-10-24 16:04:55 UTC
Okay, nevermind. I just tried with the latest of everything in RHEL8 and get this error:

14929: error : virNetDevBridgeAddPort:597 : Unable to add bridge virbr0 port vnet3: Device or resource busy

Not only that, but when I went back to my Fedora 31 machine with newer kernel+libvirt, and used a simpler config, I once again received the above error there as well!

But when I start up a machine with a more compilcated config (multiple network interfaces, vfio-assigned GPU and SR-IOV VF) suddenly *the same code* is able to attach the tap device to virbr0 *even though it is already attached to ovs-system*.

(Of course this is academically interesting, but in the end it is unimportant. The fact is that we need to detach tap devices from any existing master when we want to attach them to some other master)

Comment 10 RHEL Program Management 2021-02-15 07:41:12 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 11 Laine Stump 2021-02-15 18:21:55 UTC
Just for future reference - testing on a more recent RHEL8 machine, I've found that even for a simple configuration, if I run the test in Comment 6, the bridge will be successfully attached to the host bridge. However, there is also still a port with the name of the tested tap device listed in the output of "ovs-vsctl". (I did try restarting the openvswitch service, and the tap device remained attached to the host bridge afterwards, so it seems there is no danger of a "delayed failure")

So this doesn't cause a problem for libvirt users of OVS, it does permit "pollution" of the list of OVS ports. However this is neither caused by, nor a problem for, libvirt.

(I do still see that attempting to manually (using "ip link set vnetX master virbrX") add the tap device to a host bridge when it is on the list of ports for an OVS bridge *does* still result in an error. Not sure if that is intended behavior or not.)