Bug 1603155
Summary: | Guest fails to resume after paused due to I/O error when macTableManager='libvirt' | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Fangge Jin <fjin> | ||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||
Status: | CLOSED ERRATA | QA Contact: | yalzhang <yalzhang> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 8.0 | CC: | dyuan, jdenemar, jsuchane, lmen, rbalakri, xuzhang, yalzhang | ||||
Target Milestone: | rc | Keywords: | Reopened, Triaged | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-7.4.0-1.el8 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-11-16 07:49:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | 7.4.0 | ||||
Embargoed: | |||||||
Attachments: |
|
I can also reproduce this bug with steps: https://bugzilla.redhat.com/show_bug.cgi?id=1560854#c10 *** Bug 1647058 has been marked as a duplicate of this bug. *** Wherever the guest is being paused by the error, the fdb needs to be updated at that point to remove the entry for the guest's MAC. That way when the guest is restarted, attempting to re-add the entry for the guest's MAC won't generate an error. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. This is now fixed upstream with commit 241c22a9a531cb39d2b6b892561fe856f32f310d Refs: v7.3.0-5-g241c22a9a5 Author: Jiri Denemark <jdenemar> AuthorDate: Fri Apr 30 17:25:29 2021 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Mon May 3 11:12:58 2021 +0200 virnetdevbridge: Ignore EEXIST when adding an entry to fdb When updating entries in a bridge forwarding database (i.e., when macTableManager='libvirt' is configured for the bridge), we may end up in a situation when the entry we want to add is already present. Let's just ignore the error in such a case. This fixes an error to resume a domain when fdb entries were not properly removed when the domain was paused: virsh # resume test error: Failed to resume domain test error: error adding fdb entry for vnet2: File exists For some reason, fdb entries are only removed when libvirt explicitly stops CPUs, but nothing happens when we just get STOP event from QEMU. An alternative approach would be to make sure we always remove the entries regardless on why a domain was paused (e.g., during migration), but that would be a significantly more disruptive change with possible side effects. https://bugzilla.redhat.com/show_bug.cgi?id=1603155 Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Laine Stump <laine> Reproduce this bug on libvirt-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 with steps in bug 1647058#c0: 1. create a libvirt NAT network with <bridge name='kvm' stp='off' delay='0' macTableManager='libvirt'/> 2. create a domain with an interface connected to the network from step 1 3. start the domain 4. run below command: # virsh qemu-monitor-command rhel '{"execute":"stop"}' {"return":{},"id":"libvirt-13"} 5. check the vm status, it is paused: # virsh list Id Name State --------------------- 1 rhel paused 6. try to resume the vm, failed: # virsh resume rhel error: Failed to resume domain 'rhel' error: error adding fdb entry for vnet0: File exists # cat /var/log/libvirt/libvirtd.log | grep error 2021-05-08 10:30:58.375+0000: 24359: error : virNetDevBridgeFDBAddDel:1071 : error adding fdb entry for vnet0: File exists Test on v7.3.0-163-g156315cff4 with the same steps as above, the bug is fixed ➜ ~ virsh qemu-monitor-command pc '{"execute":"stop"}' {"return":{},"id":"libvirt-379"} ➜ ~ virsh list Id Name State --------------------- 1 pc paused ➜ ~ virsh resume pc Domain 'pc' resumed ➜ ~ virsh list Id Name State ---------------------- 1 pc running Test on libvirt-7.4.0-1.module+el8.5.0+11218+83343022.x86_64 with the same steps in comment 10: 1. Configure the network with "macTableManager='libvirt'": <bridge name='virbr0' stp='on' delay='0' macTableManager='libvirt'/> 2. Start a vm with an interface connected to the network; 3. # virsh qemu-monitor-command rhel '{"execute":"stop"}' {"return":{},"id":"libvirt-13"} 4. check the vm is paused, then try to resume it, succeed, check the logs, no errors or failures. # virsh list Id Name State --------------------- 1 rhel paused # virsh resume rhel Domain 'rhel' resumed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684 |
Created attachment 1459974 [details] libvirtd log Description of problem: Guest fails to resume after paused due to I/O error when macTableManager='libvirt' Version-Release number of selected component: libvirt-4.5.0-3.virtcov.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Prepare a virtual network, set macTableManager='libvirt' # virsh net-dumpxml default <network> <name>default</name> <uuid>ac4e8219-6225-40f7-95b2-7d731a91ea75</uuid> <forward mode='nat'> <nat> <port start='1024' end='65535'/> </nat> </forward> <bridge name='virbr0' stp='on' delay='0' macTableManager='libvirt'/> <mtu size='9000'/> <mac address='52:54:00:70:1f:6f'/> <bandwidth> <inbound average='1000' peak='5000' burst='5120'/> <outbound average='128' peak='256' burst='256'/> </bandwidth> <ip address='192.168.122.1' netmask='255.255.255.0'> <dhcp> <range start='192.168.122.2' end='192.168.122.254'/> </dhcp> </ip> </network> 2. Start a guest with virtual disk and interface as below: ... <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' rerror_policy='stop' io='threads' discard='unmap'/> <source file='/nfs/RHEL-7.6-x86_64-latest.qcow2'/> <target dev='vda' bus='virtio'/> ... <interface type='network'> <mac address='52:54:00:c6:3b:95'/> <source network='default'/> <bandwidth> <inbound average='1000' peak='5000' floor='200' burst='1024'/> <outbound average='128' peak='256' burst='256'/> </bandwidth> <model type='virtio'/> <driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off' queues='5'> <host csum='off' gso='off' tso4='off' tso6='off' ecn='off' ufo='off' mrg_rxbuf='off'/> <guest csum='off' tso4='off' tso6='off' ecn='off' ufo='off'/> </driver> <link state='up'/> <mtu size='9000'/> <coalesce> <rx> <frames max='7'/> </rx> </coalesce> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> 3. Start guest 4. Chown to root:root for /nfs/RHEL-7.6-x86_64-latest.qcow2 # chown root:root /nfs/RHEL-7.6-x86_64-latest.qcow2 5. Check guest status, it is paused # virsh list Id name status ---------------------------------------------------- 5 rhel7.5 paused 6. Try to resume guest: # chown qemu:qemu /nfs/RHEL-7.6-x86_64-latest.qcow2 # virsh resume 5 error:Resume domain 5 failed error:error adding fdb entry for vnet2: File exists Actual results: As step6, guest fails to resume Expected results: Guest can resume successfully Additional info: Guest can resume successfully if suspend it manually: # virsh suspend $guest # virsh resume $guest