Bug 1603155 - Guest fails to resume after paused due to I/O error when macTableManager='libvirt'
Summary: Guest fails to resume after paused due to I/O error when macTableManager='lib...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: yalzhang@redhat.com
URL:
Whiteboard:
: 1647058 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-19 10:32 UTC by Fangge Jin
Modified: 2021-11-16 07:53 UTC (History)
7 users (show)

Fixed In Version: libvirt-7.4.0-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-16 07:49:54 UTC
Type: Bug
Target Upstream Version: 7.4.0
Embargoed:


Attachments (Terms of Use)
libvirtd log (176.26 KB, application/x-gzip)
2018-07-19 10:32 UTC, Fangge Jin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:50:58 UTC

Description Fangge Jin 2018-07-19 10:32:20 UTC
Created attachment 1459974 [details]
libvirtd log

Description of problem:
Guest fails to resume after paused due to I/O error when  macTableManager='libvirt'

Version-Release number of selected component:
libvirt-4.5.0-3.virtcov.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a virtual network, set macTableManager='libvirt'
# virsh net-dumpxml default
<network>
  <name>default</name>
  <uuid>ac4e8219-6225-40f7-95b2-7d731a91ea75</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr0' stp='on' delay='0' macTableManager='libvirt'/>
  <mtu size='9000'/>
  <mac address='52:54:00:70:1f:6f'/>
  <bandwidth>
    <inbound average='1000' peak='5000' burst='5120'/>
    <outbound average='128' peak='256' burst='256'/>
  </bandwidth>
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254'/>
    </dhcp>
  </ip>
</network>

2. Start a guest with virtual disk and interface as below:
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' rerror_policy='stop' io='threads' discard='unmap'/>
      <source file='/nfs/RHEL-7.6-x86_64-latest.qcow2'/>
      <target dev='vda' bus='virtio'/>
...
    <interface type='network'>
      <mac address='52:54:00:c6:3b:95'/>
      <source network='default'/>
      <bandwidth>
        <inbound average='1000' peak='5000' floor='200' burst='1024'/>
        <outbound average='128' peak='256' burst='256'/>
      </bandwidth>
      <model type='virtio'/>
      <driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='off' queues='5'>
        <host csum='off' gso='off' tso4='off' tso6='off' ecn='off' ufo='off' mrg_rxbuf='off'/>
        <guest csum='off' tso4='off' tso6='off' ecn='off' ufo='off'/>
      </driver>
      <link state='up'/>
      <mtu size='9000'/>
      <coalesce>
        <rx>
          <frames max='7'/>
        </rx>
      </coalesce>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

3. Start guest

4. Chown to root:root for /nfs/RHEL-7.6-x86_64-latest.qcow2
# chown root:root /nfs/RHEL-7.6-x86_64-latest.qcow2

5. Check guest status, it is paused
# virsh list
 Id    name                         status
----------------------------------------------------
 5     rhel7.5                        paused

6. Try to resume guest:
# chown qemu:qemu /nfs/RHEL-7.6-x86_64-latest.qcow2

# virsh resume 5
error:Resume domain 5 failed
error:error adding fdb entry for vnet2: File exists


Actual results:
As step6, guest fails to resume

Expected results:
Guest can resume successfully


Additional info:
Guest can resume successfully if suspend it manually:
# virsh suspend $guest
# virsh resume $guest

Comment 2 Fangge Jin 2018-07-26 10:45:52 UTC
I can also reproduce this bug with steps: https://bugzilla.redhat.com/show_bug.cgi?id=1560854#c10

Comment 3 Laine Stump 2019-10-22 22:23:04 UTC
*** Bug 1647058 has been marked as a duplicate of this bug. ***

Comment 4 Laine Stump 2020-02-11 02:46:39 UTC
Wherever the guest is being paused by the error, the fdb needs to be updated at that point to remove the entry for the guest's MAC. That way when the guest is restarted, attempting to re-add the entry for the guest's MAC won't generate an error.

Comment 7 RHEL Program Management 2021-02-15 07:40:46 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 8 Jiri Denemark 2021-05-03 09:19:14 UTC
This is now fixed upstream with

commit 241c22a9a531cb39d2b6b892561fe856f32f310d
Refs: v7.3.0-5-g241c22a9a5
Author:     Jiri Denemark <jdenemar>
AuthorDate: Fri Apr 30 17:25:29 2021 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Mon May 3 11:12:58 2021 +0200

    virnetdevbridge: Ignore EEXIST when adding an entry to fdb

    When updating entries in a bridge forwarding database (i.e., when
    macTableManager='libvirt' is configured for the bridge), we may end up
    in a situation when the entry we want to add is already present. Let's
    just ignore the error in such a case.

    This fixes an error to resume a domain when fdb entries were not
    properly removed when the domain was paused:

       	virsh # resume test
       	error: Failed to resume domain test
       	error: error adding fdb entry for vnet2: File exists

    For some reason, fdb entries are only removed when libvirt explicitly
    stops CPUs, but nothing happens when we just get STOP event from QEMU.
    An alternative approach would be to make sure we always remove the
    entries regardless on why a domain was paused (e.g., during migration),
    but that would be a significantly more disruptive change with possible
    side effects.

    https://bugzilla.redhat.com/show_bug.cgi?id=1603155

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Laine Stump <laine>

Comment 10 yalzhang@redhat.com 2021-05-08 10:38:44 UTC
Reproduce this bug on libvirt-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 with steps in bug 1647058#c0:
1. create a libvirt NAT network with
   <bridge name='kvm' stp='off' delay='0' macTableManager='libvirt'/>
2. create a domain with an interface connected to the network from step 1
3. start the domain
4. run below command:
# virsh qemu-monitor-command rhel '{"execute":"stop"}'
{"return":{},"id":"libvirt-13"}
5. check the vm status, it is paused:
# virsh list 
 Id   Name   State
---------------------
 1    rhel   paused
6. try to resume the vm, failed:
# virsh resume rhel 
error: Failed to resume domain 'rhel'
error: error adding fdb entry for vnet0: File exists
# cat /var/log/libvirt/libvirtd.log  | grep error
2021-05-08 10:30:58.375+0000: 24359: error : virNetDevBridgeFDBAddDel:1071 : error adding fdb entry for vnet0: File exists

Comment 11 yalzhang@redhat.com 2021-05-13 10:18:50 UTC
Test on v7.3.0-163-g156315cff4 with the same steps as above, the bug is fixed
➜  ~ virsh qemu-monitor-command pc '{"execute":"stop"}'
{"return":{},"id":"libvirt-379"}

➜  ~ virsh list 
 Id   Name   State
---------------------
 1    pc     paused

➜  ~ virsh resume pc
Domain 'pc' resumed

➜  ~ virsh list 
 Id   Name   State
----------------------
 1    pc     running

Comment 14 yalzhang@redhat.com 2021-06-17 02:21:36 UTC
Test on libvirt-7.4.0-1.module+el8.5.0+11218+83343022.x86_64 with the same steps in comment 10:

1. Configure the network with "macTableManager='libvirt'":
<bridge name='virbr0' stp='on' delay='0' macTableManager='libvirt'/>

2. Start a vm with an interface connected to the network;

3. 
# virsh qemu-monitor-command rhel '{"execute":"stop"}'
{"return":{},"id":"libvirt-13"}

4. check the vm is paused, then try to resume it, succeed, check the logs, no errors or failures.
# virsh list 
 Id   Name   State
---------------------
 1    rhel   paused

# virsh resume rhel 
Domain 'rhel' resumed

Comment 16 errata-xmlrpc 2021-11-16 07:49:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.