Bug 1857967

Summary: Broken migration with a host-passthrough CPU
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Jiri Denemark <jdenemar>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Luyao Huang <lhuang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.3CC: drjones, dyuan, fjin, jdenemar, lmen, virt-maint, xuzhang, yalzhang
Target Milestone: rcKeywords: Regression, Triaged
Target Release: 8.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-6.6.0-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-17 17:50:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jiri Denemark 2020-07-16 19:28:24 UTC
Description of problem:

Second migration of a domain originally started by libvirt older than 6.5.0
may fail with an error similar to

    unable to execute QEMU command 'migrate': State blocked by non-migratable
    CPU device (invtsc flag)

Version-Release number of selected component (if applicable):

libvirt-6.5.0-1.el8

How reproducible:

always

Steps to Reproduce:
1. start libvirtd older than 6.5.0
2. start a domain with host-passthrough CPU
3. upgrade libvirtd to 6.5.0
4. migrate the domain to a host running libvirt 6.5.0 (or newer)
5. migrate the domain back to the original host

Both hosts should have identical HW and SW, specifically microcode version,
kernel version and its command line options and kvm{,_intel,amd} module
options. Otherwise migration with host-passthrough CPU may be impossible.

Actual results:

The bug may be observed in any step starting with step 3:

- after step 3 "virsh dumpxml" and "virsh dumpxml --inactive" show different
  values for the migratable attribute of the <cpu> element:
  virsh dumpxml: <cpu mode='host-passthrough' check='none' migratable='off'/>
  ... --inactive: <cpu mode='host-passthrough' check='none' migratable='on'/>

- after step 4 the domain XML shows migratable='off' and the domain log or ps
  can show the QEMU process was started with -cpu host,migratable=off

- the domain either fails to migrate in step 5 or it is again started with
  migratable='off' (depending on the host capabilities)

Expected results:

Both "virsh dumpxml" and "virsh dumpxml --inactive" should contain

    <cpu mode='host-passthrough' check='none' migratable='on'/>

after step 3.

In step 4 the domain should be started with -cpu host,migratable=on and the
domain XML should be similar to the one in step 3, i.e., with migratable='on'.

In step 5 the domain should be successfully migrated and started with
migratable=on.

Additional info:

This regression is caused by the following upstream commit:

commit 201bd5db639c063862b0c1b1abfab9a9a7c92591
Refs: v6.4.0-61-g201bd5db63
Author:     Jiri Denemark <jdenemar>
AuthorDate: Tue Jun 2 15:34:07 2020 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Jun 9 20:32:50 2020 +0200

    qemu: Fill default value in //cpu/@migratable attribute

    Before QEMU introduced migratable CPU property, "-cpu host" included all
    features that could be enabled on the host, even those which would block
    migration. In other words, the default was equivalent to migratable=off.
    When the migratable property was introduced, the default changed to
    migratable=on. Let's record the default in domain XML.

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Michal Privoznik <mprivozn>

Comment 1 Jiri Denemark 2020-07-16 20:11:18 UTC
Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2020-July/msg01236.html

Comment 2 Jiri Denemark 2020-07-21 13:45:49 UTC
This bug is now fixed upstream by

commit c7afaa69cdd712d74d98e3cb37afd1b46aef7e42
Refs: v6.5.0-274-gc7afaa69cd
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Jul 15 22:33:07 2020 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Jul 21 15:40:01 2020 +0200

    qemu_monitor: Add API for checking CPU migratable property

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Daniel Henrique Barboza <danielhb413>

commit 4872ad27aae6b24a441e7bd59bd7ae234ef33b5b
Refs: v6.5.0-275-g4872ad27aa
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Jul 15 11:33:05 2020 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Jul 21 15:40:01 2020 +0200

    qemu: Do not set //cpu/@migratable for running domains in post-parse

    Commit v6.4.0-61-g201bd5db63 started to fill the default value for
    //cpu/@migratable attribute according to QEMU support. However, active
    domains either have the migratable attribute already set or the
    capabilities we use for checking the QEMU support were created by older
    libvirt which didn't probe for this specific capability. Thus we should
    leave active domains alone when parsing their XMLs.

    https://bugzilla.redhat.com/show_bug.cgi?id=1857967

    Reported-by: Mark Mielke <mark.mielke>
    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Daniel Henrique Barboza <danielhb413>

commit 1031db36003c34d0291f3573f7d39cfae25e2cd7
Refs: v6.5.0-276-g1031db3600
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Jul 15 17:54:07 2020 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Jul 21 15:40:01 2020 +0200

    qemu: Properly set //cpu/@migratable default value for running domains

    Since active domains which do not have the attribute already set were
    not started by libvirt that probed for CPU migratable property, we need
    to check this property on reconnect and update the domain definition
    accordingly.

    https://bugzilla.redhat.com/show_bug.cgi?id=1857967

    Reported-by: Mark Mielke <mark.mielke>
    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Daniel Henrique Barboza <danielhb413>

Comment 5 Luyao Huang 2020-10-19 08:36:44 UTC
Verify this bug with libvirt-daemon-6.6.0-7.module+el8.3.0+8424+5ea525c5.x86_64:

1. prepare a host with old libvirt (<6.5.0):

# rpm -q libvirt-daemon
libvirt-daemon-6.0.0-17.3.module+el8.2.0+6907+6abdb1b6.x86_64

2. start a host-passthrough cpu mode guest:

# virsh dumpxml vm1
...
  <cpu mode='host-passthrough' check='partial'>
    <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>
...

# virsh start vm1
Domain vm1 started

3. update host to latest 8.3 virt module:

4. check guest's active xml and inactive xml

# virsh dumpxml vm1

  <cpu mode='host-passthrough' check='partial' migratable='on'>
    <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>

# virsh dumpxml vm1 --inactive

  <cpu mode='host-passthrough' check='partial' migratable='on'>
    <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>

5. migrate guest to another host which have the same test environment:

# virsh migrate vm1 qemu+ssh://host1/system --live


6. check guest xml and qemu command line on target host:

# virsh dumpxml vm1
...
  <cpu mode='host-passthrough' check='partial' migratable='on'>
    <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>
...

# virsh dumpxml vm1 --inactive
...
  <cpu mode='host-passthrough' check='partial' migratable='on'>
    <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>
...

# ps aux|grep qemu
...-cpu host,migratable=on...

7. migrate back to source host:

# virsh migrate vm1 qemu+ssh://host0/system --live

8. check guest xml and qemu command line on source host:

# virsh dumpxml vm1
...
  <cpu mode='host-passthrough' check='partial' migratable='on'>
    <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>
...

# virsh dumpxml vm1 --inactive

  <cpu mode='host-passthrough' check='partial' migratable='on'>
    <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>

# ps uax|grep qemu
...-cpu host,migratable=on...

Comment 8 errata-xmlrpc 2020-11-17 17:50:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137