Bug 1495171
Summary: | Post libvirt upgrade to 3.2.0-14, migration fails with -- "can't apply global Haswell-noTSX-x86_64-cpu.cmt=off: Property '.cmt' not found" | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sergii Mykhailushko <smykhail> | ||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||
Status: | CLOSED ERRATA | QA Contact: | zhe peng <zpeng> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 7.2 | CC: | aglotov, anrussel, berrange, chhu, dasmith, dhill, dvd, dyuan, eglynn, fabian, jdenemar, jherrman, jishao, kchamart, mkalinin, pneedle, rbalakri, rbryant, rhodain, sbauza, sferdjao, sgordon, srevivo, thomas.oulevey, toneata, vromanso, xuzhang, yafu, zpeng | ||||||
Target Milestone: | pre-dev-freeze | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-3.9.0-1.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously, the libvirt service in some cases added the "cmt" CPU feature incompatible with the QEMU emulator to KVM guest virtual machines with CPU set to "host-model". As a consequence, migrating or restoring these guests failed. With this update, libvirt no longer adds "cmt" to domain features and automatically removes "cmt" from guest configuration if present. As a result, the affected guests can be migrated and restored correctly.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1508010 1508549 (view as bug list) | Environment: | |||||||
Last Closed: | 2018-04-10 10:57:19 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1199452, 1508549 | ||||||||
Attachments: |
|
Description
Sergii Mykhailushko
2017-09-25 12:03:22 UTC
First off, from looking at the logs, the source libvirt is NOT properly upgraded to libvirt-3.2.0-14.el7_4.3.x86_64. Let's see why. You see the libvirt version on the source host in the logs as 'libvirt-3.2.0-14.el7_4.3.x86_64' because the upgrade was performed _while_ the Nova instance is still running on the source host with old libvirt / QEMU versions. So, the following versions of libvirt is what you see on source and destination hosts for the Nova instance 92a68a77-6e75-43e5-9e7f-1392838ef18f ('instance-000010c8') that's being migrated. (Found them by looking at the relevant QEMU command-lines for the Nova instance /var/log/libvirt/qemu/instance-000010c8.log on source and destination hosts): - Source (cl-ra15-n8.mgt.cluster): - libvirt version: 1.2.17, package: 13.el7_2.4 - qemu-kvm-rhev-2.3.0-31.el7_2.13) - Destination (cl-ra15-n24.mgt.cluster): - libvirt version: 3.2.0, package: 14.el7_4.3 - qemu-kvm-rhev-2.9.0-10.el7 The following is the QEMU command-line for the Nova instance '92a68a77-6e75-43e5-9e7f-1392838ef18f' ('instance-000010c8') from both source and destination hosts. QEMU invocation on source host: ----------------------------------------------------------------------- 2017-01-20 10:06:45.332+0000: starting up libvirt version: 1.2.17, package: 13.el7_2.4 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-03-02-11:10:27, x86-034.build.eng.bos.redhat.com), qemu version: 2.3.0 (qemu-kvm-rhev-2.3.0-31.el7_2.13) LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name instance-000010c8 -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 92a68a77-6e75-43e5-9e7f-1392838ef18f -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=12.0.3-1.el7ost,serial=d336b2a1-4abd-4602-9bc4-4aeb20f778e7,uuid=92a68a77-6e75-43e5-9e7f-1392838ef18f,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-instance-000010c8/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/92a68a77-6e75-43e5-9e7f-1392838ef18f/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:12:82:a5,bus=pci.0,addr=0x3 -netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:df:df:79,bus=pci.0,addr=0x4 -chardev file,id=charserial0,path=/var/lib/nova/instances/92a68a77-6e75-43e5-9e7f-1392838ef18f/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on char device redirected to /dev/pts/0 (label charserial1) ----------------------------------------------------------------------- QEMU invocation on destination host: ----------------------------------------------------------------------- 2017-09-22 15:29:52.078+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-08-22-08:54:01, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-10.el7), hostname: cl-ra15-n24.mgt.cluster LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-000010c8,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-11-instance-000010c8/master-key.aes -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX,vme=on,ds=off,acpi=off,ss=on,ht=off,tm=off,pbe=off,dtes64=off,monitor=off,ds_cpl=off,vmx=off,smx=off,est=off,tm2=off,xtpr=off,pdcm=off,dca=off,osxsave=off,f16c=on,rdrand=on,arat=off,tsc_adjust=off,cmt=off,xsaveopt=on,pdpe1gb=on,abm=on,hypervisor=on -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 92a68a77-6e75-43e5-9e7f-1392838ef18f -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=12.0.3-1.el7ost,serial=d336b2a1-4abd-4602-9bc4-4aeb20f778e7,uuid=92a68a77-6e75-43e5-9e7f-1392838ef18f,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-11-instance-000010c8/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/92a68a77-6e75-43e5-9e7f-1392838ef18f/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:12:82:a5,bus=pci.0,addr=0x3 -netdev tap,fd=34,id=hostnet1,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:df:df:79,bus=pci.0,addr=0x4 -netdev tap,fd=36,id=hostnet2,vhost=on,vhostfd=37 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=fa:16:3e:5a:b8:c4,bus=pci.0,addr=0x7 -add-fd set=6,fd=39 -chardev file,id=charserial0,path=/dev/fdset/6,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:5 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2017-09-22T15:29:52.179516Z qemu-kvm: -chardev pty,id=charserial1: char device redirected to /dev/pts/5 (label charserial1) 2017-09-22T15:29:52.198139Z qemu-kvm: can't apply global Haswell-noTSX-x86_64-cpu.cmt=off: Property '.cmt' not found 2017-09-22 15:29:52.228+0000: shutting down, reason=failed ----------------------------------------------------------------------- Created attachment 1332074 [details]
To be migrated Nova instance XML from source Compute node
[Drilling down a bit further. Also thanks to Daniel Berrangé for help in investigating this so far.] If you see the Nova instance XML (https://bugzilla.redhat.com/attachment.cgi?id=1332074) prepared by libvirt on the source host, you'll see: <cpu mode="custom" match="exact" check="partial"> <model fallback="allow">Haswell-noTSX</model> [...] <feature policy="require" name="cmt"/> [...] And the original arguments used to boot the guest did not say anything about the 'cmt' feature, it just ran: [...] -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme [...] But when migrated to the destination host, libvirt is trying to run: [...] -cpu Haswell-noTSX,vme=on,ds=off,acpi=off,ss=on,ht=off,tm=off,pbe=off,dtes64=off,monitor=off,ds_cpl=off,vmx=off,smx=off,est=off,tm2=off,xtpr=off,pdcm=off,dca=off,osxsave=off,f16c=on,rdrand=on,arat=off,tsc_adjust=off,cmt=off,xsaveopt=on,pdpe1gb=on,abm=on,hypervisor=on [...] Notice the "cmt=off" bit above. So despite having "<feature policy="require" name="cmt"/>", libvirt seems to be still setting 'cmt' to 'off' on the destination. Jiri: any insights here? From comment#5, it seems that the new libvirt on the source somehow thinks it needs 'cmt=on' (refer: https://bugzilla.redhat.com/attachment.cgi?id=1332074) in the guest XML when it wasn't there originally -- notice this from the '-cpu' QEMU command-line of the Nova instance on the source host. --- Just noting for context, a related (fixed) bug: https://bugzilla.redhat.com/show_bug.cgi?id=1365500 -- CPU feature cmt not found with 2.0.0-1 Created attachment 1332325 [details]
Nova instance guest XML at the time of launch (2017-01-20 10:06:44) on the source host
This was obtained from:
cl-ra15-n8.mgt.cluster/etc/libvirt/qemu/instance-000010c8.xml
I was going to suggest shutting down the VM on the source and remove "cmt" from the domain XML. However, this is interesting: (In reply to Kashyap Chamarthy from comment #5) [...] > But when migrated to the destination host, libvirt is trying to run: > > [...] > -cpu > Haswell-noTSX,vme=on,ds=off,acpi=off,ss=on,ht=off,tm=off,pbe=off,dtes64=off, > monitor=off,ds_cpl=off,vmx=off,smx=off,est=off,tm2=off,xtpr=off,pdcm=off, > dca=off,osxsave=off,f16c=on,rdrand=on,arat=off,tsc_adjust=off,cmt=off, > xsaveopt=on,pdpe1gb=on,abm=on,hypervisor=on > [...] > > Notice the "cmt=off" bit above. > > So despite having "<feature policy="require" name="cmt"/>", libvirt > seems to be still setting 'cmt' to 'off' on the destination. > It looks like libvirt knows it needs to remove "cmt" when migrating, which is really good news, so maybe we don't need to shut down the VM and edit the domain XML on the source to fix the problem. But for some reason libvirt believes "cmt=off" is necessary on the destination. I hope Jiri can help find a workaround or a fix for this. The domain was started on libvirt-1.2.17-13.el7_2.4 and qemu-kvm-rhev-2.3.0-31.el7_2.13. While the domain was running the host was migrated to 7.4, which means libvirt-3.2.0-14.el7_4.3 and qemu-kvm-rhev-2.9.0-10.el7. At this point new libvirt reconnected to the running domain (which was still using qemu-kvm-rhev-2.3.0-31.el7_2.13). Because the old libvirt didn't replace the host-model CPU with the actual custom CPU definition which was used when starting a domain (old libvirt used to do this replacement on demand rather than doing it just once and storing the result in live definition), the new libvirt needs to do this replacement while reconnecting to the running domain (this was addressed in bug 1470582). We mimic what the old libvirt did when starting the domain by using the host CPU from capabilities XML as a base. Since the new libvirt knows such CPU model would almost never match the CPU QEMU actually provided to a guest, we ask QEMU for all enabled CPUID bits and disable all features in the base CPU model which were not enabled by QEMU. This way we can get a CPU definition which matches the virtual CPU seen by the guest OS. The thing is "To be migrated Nova instance XML from source Compute node" contains no disabled features and yet, there are several of them disabled when the domain gets started on the destination. I'll try to reproduce this issue once I get a machine with cmt. I see, the "to be migrated XML" is a migratable XML rather than the active XML of the domain. This XML is transferred during migration to the destination host for backward compatibility. The libvirt from RHEL-7.4 on the destination host will use the active XML which contains the disabled features and thus we can see a lot of disabled features on the command line there. So the main problem here is that the process of translating host-model to a custom mode CPU when libvirt reconnects to existing domains started by old libvirt does not count with the bloody cmt feature which is only known to libvirt. There is a simple, although very hackish workaround: 1. systemctl stop libvirtd 2. edit /var/run/libvirt/qemu/instance-000010c8.xml and remove all lines with <feature ... name='cmt'/> 3. systemctl start libvirtd (In reply to Jiri Denemark from comment #11) > I see, the "to be migrated XML" is a migratable XML rather than the active > XML of the domain. Indeed, that's the correct interpretation. I tried to be clear when describing that XML. :-) > This XML is transferred during migration to the destination > host for backward compatibility. The libvirt from RHEL-7.4 on the destination > host will use the active XML which contains the disabled features and thus we > can see a lot of disabled features on the command line there. Ah-ha, thanks for confirming that. > So the main problem here is that the process of translating host-model to a > custom mode CPU when libvirt reconnects to existing domains started by old > libvirt does not count with the bloody cmt feature which is only known to > libvirt. Heh, indeed. And indeed your workaround is what I recommended to the bug reporter on IRC. That seems to be the only possible way to fix this _without_ guest down-time. Thanks for the overall analysis. *** Bug 1497320 has been marked as a duplicate of this bug. *** (In reply to Kashyap Chamarthy from comment #13) > And indeed your workaround is what I recommended to the bug reporter on IRC. > That seems to be the only possible way to fix this _without_ guest down-time. Yeah, until this bug is fixed in which case upgrading libvirt should solve the issue automatically. Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2017-October/msg00432.html This bug is now fixed upstream by commit 4b87b3675ffd7794542de70da4391f3787ed76a2 Refs: v3.8.0-132-g4b87b3675f Author: Jiri Denemark <jdenemar> AuthorDate: Fri Oct 6 12:57:15 2017 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Oct 17 15:08:05 2017 +0200 qemu: Separate CPU updating code from qemuProcessReconnect The new function is called qemuProcessRefreshCPU. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Pavel Hrdina <phrdina> commit 3276416904393a06df664c5d849ee805d07688d8 Refs: v3.8.0-133-g3276416904 Author: Jiri Denemark <jdenemar> AuthorDate: Mon Oct 9 16:20:43 2017 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Oct 17 15:08:05 2017 +0200 conf: Introduce virCPUDefFindFeature Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Pavel Hrdina <phrdina> commit e26cc8f82ff346c9ec90409bac06581b64e42b20 Refs: v3.8.0-134-ge26cc8f82f Author: Jiri Denemark <jdenemar> AuthorDate: Fri Oct 6 13:23:36 2017 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Oct 17 15:08:05 2017 +0200 qemu: Filter CPU features when using host CPU When reconnecting to a domain started with a host-model CPU which was started by old libvirt that did not replace host-model with the real CPU definition, libvirt replaces the host-model CPU with the CPU from capabilities (because this is what the old libvirt did when it started the domain). Without this patch libvirt could use features unknown to QEMU in the CPU definition which replaced the original host-model CPU. Such domain would keep running just fine, but any attempt to migrate it will fail and once the domain is saved or snapshotted, restoring it would fail too. In other words whenever we want to use the CPU definition from host capabilities as a guest CPU definition, we have to filter the unknown features. https://bugzilla.redhat.com/show_bug.cgi?id=1495171 Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Pavel Hrdina <phrdina> commit 6a6f6b91e0e76480ea961f83135efcb4faf3284a Refs: v3.8.0-135-g6a6f6b91e0 Author: Jiri Denemark <jdenemar> AuthorDate: Fri Oct 6 14:49:07 2017 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Oct 17 15:08:05 2017 +0200 qemu: Fix CPU model broken by older libvirt When libvirt older than 3.9.0 reconnected to a running domain started by old libvirt it could have messed up the expansion of host-model by adding features QEMU does not support (such as cmt). Thus whenever we reconnect to a running domain, revert to an active snapshot, or restore a saved domain we need to check the guest CPU model and remove the CPU features unknown to QEMU. We can do this because we know the domain was successfully started, which means the CPU did not contain the features when libvirt started the domain. https://bugzilla.redhat.com/show_bug.cgi?id=1495171 Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Pavel Hrdina <phrdina> Hi all, I have cloned this Bugzilla as follows, for RHEL 7.4.z: Bug 1508010 - Post libvirt upgrade to 3.2.0-14, migration fails with -- "can't apply global Haswell-noTSX-x86_64-cpu.cmt=off: Property '.cmt' not found" [RHEL 7.4.z] https://bugzilla.redhat.com/show_bug.cgi?id=1508010 Kind regards, Paul. I can reproduce this: step: 1. prepare a host with cmt 2. install rhel7.2.z(include libvirt,qemu-kvm-rhev) 3. prepare a guest with host-model cpu setting and start check guest xml: .... <cpu mode='host-model'> <model fallback='allow'>Haswell-noTSX</model> <vendor>Intel</vendor> <topology sockets='2' cores='1' threads='1'/> </cpu> .... 4. update libvirt to rhel7.4.z build(libvirt-3.2.0-14.el7_4.3.x86_64) 5. check guest xml #virsh dumpxml $guest ... <cpu mode='custom' match='exact' check='full'> <model fallback='allow'>Haswell-noTSX</model> <vendor>Intel</vendor> <topology sockets='2' cores='1' threads='1'/> ... <feature policy='disable' name='cmt'/> ... </cpu> ... 6. migrate guest to dst host(rhel7.4.z installed) will get error: # virsh migrate --live rhel7.2 qemu+ssh://$target/system --verbose error: internal error: qemu unexpectedly closed the monitor: 2017-11-03T10:07:45.618330Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/3 (label charserial0) 2017-11-03T10:07:45.626584Z qemu-kvm: can't apply global Haswell-noTSX-x86_64-cpu.cmt=off: Property '.cmt' not found only update libvirt then migrate vm will cause the error. verify with build: libvirt-3.9.0-3.el7.x86_64 step same with comment 28 # virsh migrate --live rhel7.2 qemu+ssh://$target_host/system --verbose Migration: [100 %] check guest xml on target host: .... <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>Haswell-noTSX</model> <vendor>Intel</vendor> <topology sockets='2' cores='1' threads='1'/> <feature policy='require' name='vme'/> <feature policy='disable' name='ds'/> <feature policy='disable' name='acpi'/> <feature policy='require' name='ss'/> <feature policy='disable' name='ht'/> <feature policy='disable' name='tm'/> <feature policy='disable' name='pbe'/> <feature policy='disable' name='dtes64'/> <feature policy='disable' name='monitor'/> <feature policy='disable' name='ds_cpl'/> <feature policy='disable' name='vmx'/> <feature policy='disable' name='smx'/> <feature policy='disable' name='est'/> <feature policy='disable' name='tm2'/> <feature policy='disable' name='xtpr'/> <feature policy='disable' name='pdcm'/> <feature policy='disable' name='dca'/> <feature policy='disable' name='osxsave'/> <feature policy='require' name='f16c'/> <feature policy='require' name='rdrand'/> <feature policy='disable' name='arat'/> <feature policy='disable' name='tsc_adjust'/> <feature policy='require' name='xsaveopt'/> <feature policy='require' name='pdpe1gb'/> <feature policy='require' name='abm'/> <feature policy='require' name='hypervisor'/> </cpu> .... move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0704 |