Bug 1611044
| Summary: | Guest fails to start sometimes when an interface name is already occupied by ovs port and the interface doesn't exist actually | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Fangge Jin <fjin> | ||||
| Component: | libvirt | Assignee: | Laine Stump <laine> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | yalzhang <yalzhang> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 8.0 | CC: | dyuan, laine, lmen, qguo, rbalakri, xuzhang, yalzhang | ||||
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-02-15 07:41:12 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
When you say you get this 30% of the time, do you mean under normal operation *without* artificially/manually adding the port for the not-yet-existing device to the OVS bridge? If that's the case, then we should be looking for what is the chain of events that causes the port to already be created (most probably the teardown of a previous guest's plumbing was interrupted/incomplete. Beyond that, it will probably be okay to check for an existing connection and remove it prior to adding the port to the new bridge (I recently added an internal function to libvirt that can report the current master on an OVS bridge for a tap device - virNetDevOpenvswitchInterfaceGetMaster(). It should probably be used in combination with virNetDevGetMaster(), similar to whats done in bridge_driver.c) (In reply to Laine Stump from comment #2) > When you say you get this 30% of the time, do you mean under normal > operation *without* artificially/manually adding the port for the > not-yet-existing device to the OVS bridge? If that's the case, then we > should be looking for what is the chain of events that causes the port to > already be created (most probably the teardown of a previous guest's > plumbing was interrupted/incomplete. > At first, I didn't *manually* add the port for the not-yet-existing device to the OVS bridge, and guest failed to start with such error "Unable to add bridge virbr0 port vnet6: Device or resource busy" occasionally. I checked ovs-vsctl and see the not-yet-existing device, it was from a previous guest that has been shutoff definitely. But I don't know how to make this happen again. Then I found I can *manually* add a such port and reproduce the issue that guest fails to start. For 30%, I mean I *manually* add a such port, and guest fails to start at 30% possibility. I will do more test and try to see whether I can find a way that makes guest shutoff but leaves a non-existing device attached to OVS bridge. > Beyond that, it will probably be okay to check for an existing connection > and remove it prior to adding the port to the new bridge (I recently added > an internal function to libvirt that can report the current master on an OVS > bridge for a tap device - virNetDevOpenvswitchInterfaceGetMaster(). It > should probably be used in combination with virNetDevGetMaster(), similar to > whats done in bridge_driver.c) So virNetDevOpenvswitchInterfaceGetMaster() will resolve the issue that "*manually* add a such port, and guest fails to start", right? (In reply to Fangge Jin from comment #3) > So virNetDevOpenvswitchInterfaceGetMaster() will resolve the issue that > "*manually* add a such port, and guest fails to start", right? That was just a note for whoever implements the fix, in case it isn't me. That function can be used to learn if there is already a port by that name and, if so, what bridge it is attached to. We would call that function prior to attempting to attach the port to the bridge, then use a *different* function to remove any pre-existing connection prior to attaching it to the desired bridge. It would still be good to know about any possible circumstance where an orphaned port connection might remain. We should try to prevent that as much as possible. I just found a method to "make guest shutoff with ovs port left", but the scenario seems a little corner:
1. Prepare a running guest with an interface(vnet0 in my case) attached to ovs bridge:
# virsh domiflist rhel7-min
Interface Type Source Model MAC
-------------------------------------------------------
vnet0 bridge ovs-net rtl8139 52:54:00:3c:f8:25
vnet1 network default virtio 52:54:00:f1:2c:ed
2. Stop openvswitch.service:
# systemctl stop openvswitch.service
3. Destroy guest:
# virsh destroy rhel7-min
4. Start openvswitch.service:
# systemctl start openvswitch.service
5. Check ovs bridge, vnet0 is shown:
# ovs-vsctl show
9ae0e2e4-bf4b-4c60-8227-7e3f64dac912
Bridge "ovsbr0"
Port "ovsbr0"
Interface "ovsbr0"
type: internal
Port "vnet2"
Interface "vnet2"
error: "could not open network device vnet2 (No such device)"
Port "vnet3"
Interface "vnet3"
error: "could not open network device vnet3 (No such device)"
Port "vnet6"
Interface "vnet6"
error: "could not open network device vnet6 (No such device)"
Port "vnet5"
Interface "vnet5"
error: "could not open network device vnet5 (No such device)"
Port "vnet0"
Interface "vnet0"
error: "could not open network device vnet0 (No such device)"
6. # ip a |grep vnet0
(no output)
There is a further twist to this that I recently discovered - if a tap device has been previously attached to an OVS bridge, and then for some reason didn't get detached (maybe someone did it manually, maybe libvirt crashed, etc), then in the future any time a new tap device with that same name is created, it will be create attached to the OVS bridge. *This persists across reboots of the host system!* Here's how to make this happen: ip tuntap add mode tap name vnet0 vs-vsctl add-port ovsbr0 vnet0 ip link del vnet0 After doing this, vnet0 no longer exists, but the OVS db still has an entry connecting it to ovsbr0. Any time vnet0 is re-created, it will show "master ovs-system" in the output of "ip link show vnet0". As recently as a month ago, when I tried to start a guest that created, e.g. vnet0, and the guest wanted to attach it to the host bridge br0, but it was still registered as attached to an OVS bridge (see above for how to setup that scenario), then the tap device creation/attach would fail with something like "Device in Use" (I've forgotten the exact error log, and can't reproduce it now). But today when I do that with libvirt from git master, on a Fedora 30 system with kernel 5.2.17-200.fc30.x86_64, the guest startup *succeeds* (and the tap device is attached to the proper bridge). On the other hand, if I attempt to do the same thing manually (using "ip tuntap add...; brctl addif virbr0 vnet0") it fails. I've looked through the most recent code and see that we simply call ioctl(fd, SIOCBRADDIF,...), and therefore we should be failing. Can you retest this on your system and see if you're still able to make it fail with up-to-date packages? Okay, nevermind. I just tried with the latest of everything in RHEL8 and get this error: 14929: error : virNetDevBridgeAddPort:597 : Unable to add bridge virbr0 port vnet3: Device or resource busy Not only that, but when I went back to my Fedora 31 machine with newer kernel+libvirt, and used a simpler config, I once again received the above error there as well! But when I start up a machine with a more compilcated config (multiple network interfaces, vfio-assigned GPU and SR-IOV VF) suddenly *the same code* is able to attach the tap device to virbr0 *even though it is already attached to ovs-system*. (Of course this is academically interesting, but in the end it is unimportant. The fact is that we need to detach tap devices from any existing master when we want to attach them to some other master) After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Just for future reference - testing on a more recent RHEL8 machine, I've found that even for a simple configuration, if I run the test in Comment 6, the bridge will be successfully attached to the host bridge. However, there is also still a port with the name of the tested tap device listed in the output of "ovs-vsctl". (I did try restarting the openvswitch service, and the tap device remained attached to the host bridge afterwards, so it seems there is no danger of a "delayed failure") So this doesn't cause a problem for libvirt users of OVS, it does permit "pollution" of the list of OVS ports. However this is neither caused by, nor a problem for, libvirt. (I do still see that attempting to manually (using "ip link set vnetX master virbrX") add the tap device to a host bridge when it is on the list of ports for an OVS bridge *does* still result in an error. Not sure if that is intended behavior or not.) |
Created attachment 1472215 [details] syslog and libvirtd.log Description of problem: Guest fails to start occasionally when an interface name is already occupied by ovs port and the interface doesn't exist actually Version-Release number of selected component (if applicable): libvirt-4.5.0-6.virtcov.el7.x86_64 kernel-3.10.0-924.el7.x86_64 qemu-kvm-rhev-2.12.0-8.el7.x86_64 openvswitch-2.9.0-19.el7fdp.x86_64 NetworkManager-1.12.0-1.el7.x86_64 How reproducible: About 30% in my env Steps to Reproduce: 1.Prepare a ovs and add a non-existing port to it: # systemctl start openvswitch # ovs-vsctl add-br ovsbr0 # ovs-vsctl add-port ovsbr0 vnet6 # ip a |grep vnet (In the output, vnet0~5 exist, vnet6 doesn't exist) # ovs-vsctl show 9ae0e2e4-bf4b-4c60-8227-7e3f64dac912 Bridge "ovsbr0" ... Port "ovsbr0" Interface "ovsbr0" type: internal ... Port "vnet6" Interface "vnet6" error: "could not open network device vnet6 (No such device)" 2.Prepare a guest with an interface: # virsh domiflist rhel7-min-1 Interface Type Source Model MAC ------------------------------------------------------- - network default virtio 52:54:00:b4:e4:88 3.Start&&destroy guest in a loop, guest fails to start sometimes: # while true; do virsh start rhel7-min-1; sleep 1; virsh destroy rhel7-min-1; done Domain rhel7-min-1 started Domain rhel7-min-1 destroyed error: Failed to start domain rhel7-min-1 error: Unable to add bridge virbr0 port vnet6: Device or resource busy error: Failed to destroy domain rhel7-min-1 error: Requested operation is not valid: domain is not running Domain rhel7-min-1 started Domain rhel7-min-1 destroyed Domain rhel7-min-1 started Domain rhel7-min-1 destroyed Domain rhel7-min-1 started Domain rhel7-min-1 destroyed Domain rhel7-min-1 started Domain rhel7-min-1 destroyed error: Failed to start domain rhel7-min-1 error: Unable to add bridge virbr0 port vnet6: Device or resource busy error: Failed to destroy domain rhel7-min-1 error: Requested operation is not valid: domain is not running Domain rhel7-min-1 started Domain rhel7-min-1 destroyed error: Failed to start domain rhel7-min-1 error: Unable to add bridge virbr0 port vnet6: Device or resource busy error: Failed to destroy domain rhel7-min-1 error: Requested operation is not valid: domain is not running 4.When guest starts successfully, check ovs: # ovs-vsctl show 9ae0e2e4-bf4b-4c60-8227-7e3f64dac912 Bridge "ovsbr0" Port "ovsbr0" Interface "ovsbr0" type: internal ... Port "vnet6" Interface "vnet6" error: "could not add network device vnet6 to ofproto (Device or resource busy)" ... Actual results: As step3, guest fails to start sometimes. Expected results: Guest can always start successfully Additional info: