Bug 1439078
Summary: | After migration,VM crash in dst host with "qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff" | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | xianwang <xianwang> | ||||||||||||||||||
Component: | qemu-kvm-rhev | Assignee: | Dr. David Alan Gilbert <dgilbert> | ||||||||||||||||||
Status: | CLOSED NOTABUG | QA Contact: | xianwang <xianwang> | ||||||||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||
Version: | 7.4 | CC: | berrange, bmordeha, cbf123, chayang, dgilbert, hhuang, jdenemar, juzhang, michen, peterx, quintela, qzhang, virt-maint, xianwang, zhengtli | ||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||
Last Closed: | 2017-04-20 11:23:11 UTC | Type: | Bug | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Attachments: |
|
Description
xianwang
2017-04-05 07:49:23 UTC
Created attachment 1268880 [details] bug1439078_gdb_info.txt 1)I have add gdb infomation as attachment 2)src host: [root@dell-per630-01 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Stepping: 2 CPU MHz: 1790.250 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 4800.30 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dst host: [root@dell-per630-02 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Stepping: 2 CPU MHz: 1200.000 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 4799.74 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc 3)if qemu cli both in src and dst drop "-cpu host", migration can succeed and vm can work well Hi, 38F is 'IA32_PERF_GLOBAL_CTRL' Can I check a few things please: a) Can you confirm this is native on the host - you're not trying to run nested? b) Can you please attach the output of 'dmesg' from the boot of the host prior to starting the guest. an E5-2630 v3 is a Haswell. If I read the docs right it hasn't got 38F - but I need to check. Please also provide the output of: x86info -a on both hosts and the guest (In reply to Dr. David Alan Gilbert from comment #5) > Please also provide the output of: > x86info -a > > on both hosts and the guest I have re-test this scenario on my local intel hosts, but this bug can't be reproduced, so, I have submitted jobs in beaker to reserve Haswell hosts, and I will update test information later. I tried reproducing it and I can't on the similar box I have: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 63 model name : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz stepping : 2 microcode : 0x38 cpu MHz : 2999.812 cache size : 20480 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 15 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc bogomips : 4794.20 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: using /usr/libexec/qemu-kvm -M pc,accel=kvm -m 8G rhel-guest-image-7.4-106.x86_64.qcow2 -vnc :0 -monitor stdio -cpu host -smp 4 with either HEAD qemu or 2.9.0-rc based packages. (In reply to Dr. David Alan Gilbert from comment #7) > I tried reproducing it and I can't on the similar box I have: > > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 63 > model name : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz > stepping : 2 > microcode : 0x38 > cpu MHz : 2999.812 > cache size : 20480 KB > physical id : 0 > siblings : 16 > core id : 0 > cpu cores : 8 > apicid : 0 > initial apicid : 0 > fpu : yes > fpu_exception : yes > cpuid level : 15 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat > pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb > rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology > nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx > est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt > tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln > pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 > avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc > bogomips : 4794.20 > clflush size : 64 > cache_alignment : 64 > address sizes : 46 bits physical, 48 bits virtual > power management: > > using /usr/libexec/qemu-kvm -M pc,accel=kvm -m 8G > rhel-guest-image-7.4-106.x86_64.qcow2 -vnc :0 -monitor stdio -cpu host -smp 4 > > with either HEAD qemu or 2.9.0-rc based packages. Hi, Dave, I have re test this issue, I can reproduce it, the qemu cli of booting guest is as below /usr/libexec/qemu-kvm \ -name 'vm1' \ -sandbox off \ -machine pc-i440fx-rhel7.4.0 \ -nodefaults \ -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=04 \ -device usb-ehci,id=usb1,bus=pci.0,addr=06 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=09 \ -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=unsafe,format=qcow2,file=/root/rhel74-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bus=pci.0,bootindex=0 \ -device virtio-net-pci,mac=9a:4f:50:51:52:53,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0 \ -netdev tap,id=idjlQN53,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -m 4G \ -smp 4 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device usb-mouse,id=input1,bus=usb1.0,port=2 \ -device usb-kbd,id=input2,bus=usb1.0,port=3 \ -vnc :1 \ -qmp tcp:0:8881,server,nowait \ -vga std \ -cpu host \ -monitor stdio \ -rtc base=localtime \ -boot order=cdn,once=c,menu=on,strict=off \ -enable-kvm \ -watchdog i6300esb \ -watchdog-action reset \ -device virtio-balloon-pci,id=balloon0,bus=pci.0 I upload the debug info as attachments. Created attachment 1272855 [details]
dsthost_dmesg
Created attachment 1272856 [details]
dsthost_x86info
Created attachment 1272857 [details]
gdb_dst_boot_guest
Created attachment 1272858 [details]
guest_dmesg
Created attachment 1272859 [details]
guest_x86info
Created attachment 1272860 [details]
srchost_dmesg
Created attachment 1272861 [details]
srchost_x86info
Interesting, I think this is a real bug.
dell-per630-01 has hyperthreading disabled - it shows 16 CPUs
dell-per630-02 has hyperthreading enabled - it shows 32 CPUs
One of the differences of the x86info is:
190c190
< eax in: 0x0000000a, eax = 07300803 ebx = 00000000 ecx = 00000000 edx = 00000603
---
> eax in: 0x0000000a, eax = 07300403 ebx = 00000000 ecx = 00000000 edx = 00000603
From Intel table 3-8 it says eax field 8-15:
Number of general-purpose performance monitoring counter per logical processor.
So the CPU with hyperthreading has half of the counters of the host without hyperthreading; but that makes sense, the counters have been split between the threads.
Each bit in MSR 38f is an 'enable' for one of those counters, the value we're trying to write (...ff) is trying to enable 8 counters, which our hyperthreaded destination doesn't have.
Closing as not-a-bug because: a) The use of -cpu host requires identical cpus b) The source CPU in this system was configured without hyperthreading while the destination was configured with hyperthreading; this changes not only the number of CPUs but also some of the characteristics of the CPU (some of the number of counters) Note that we'll hit similar problems if you use a none-host cpu but enable the PMU; but that's already a problem where migration is known not to succeed with perf counters. (In reply to Dr. David Alan Gilbert from comment #17) > Closing as not-a-bug because: > a) The use of -cpu host requires identical cpus > b) The source CPU in this system was configured without hyperthreading > while the destination was configured with hyperthreading; this changes not > only the number of CPUs but also some of the characteristics of the CPU > (some of the number of counters) > > Note that we'll hit similar problems if you use a none-host cpu but enable > the PMU; but that's already a problem where migration is known not to > succeed with perf counters. 1)Before doing migration, do we need to check the following two requirements? a) The use of -cpu host requires identical cpus b) The CPU configuration for hyperthreading of src host and dst host are same, ie, both src and dst host enable hyperthreading or disable it 2)If these two requirements are not matched, maybe there is something wrong for migration but not a bug, right? 3)could you help me to check if the following method to check these two parameter right? a) The use of -cpu host requires identical cpus #cat /proc/cpuinfo | grep processor If this value of src host and dst host are same, it indicate they have dentical cpus, yes? b)The CPU configuration for hyperthreading #dmidecode -t processor | grep -E '(Core Count|Thread Count)' If the "Thread Count" is double "Core Count", this indicate the hyperthreading is enabled, or, if the "Thread Count" is same as "Core Count", this indicate the hyperthreading is disabled, yes? (In reply to xianwang from comment #18) > (In reply to Dr. David Alan Gilbert from comment #17) > > Closing as not-a-bug because: > > a) The use of -cpu host requires identical cpus > > b) The source CPU in this system was configured without hyperthreading > > while the destination was configured with hyperthreading; this changes not > > only the number of CPUs but also some of the characteristics of the CPU > > (some of the number of counters) > > > > Note that we'll hit similar problems if you use a none-host cpu but enable > > the PMU; but that's already a problem where migration is known not to > > succeed with perf counters. > > 1)Before doing migration, do we need to check the following two requirements? > a) The use of -cpu host requires identical cpus Correct. > b) The CPU configuration for hyperthreading of src host and dst host are > same, ie, both src and dst host enable hyperthreading or disable it Yes, when using -cpu host > 2)If these two requirements are not matched, maybe there is something wrong > for migration but not a bug, right? I don't understand this question. > 3)could you help me to check if the following method to check these two > parameter right? > a) The use of -cpu host requires identical cpus > #cat /proc/cpuinfo | grep processor > If this value of src host and dst host are same, it indicate they have > dentical cpus, yes? It's probably best to use the 'model name' field > b)The CPU configuration for hyperthreading > #dmidecode -t processor | grep -E '(Core Count|Thread Count)' > If the "Thread Count" is double "Core Count", this indicate the > hyperthreading is enabled, or, if the "Thread Count" is same as "Core > Count", this indicate the hyperthreading is disabled, yes? Yes, I think that's right. Will calling virConnectCompareCPU() detect that the destination host is incompatible with the source host? If not, perhaps it would be reasonable to call this a bug? It depends. virConnectCompareCPU will just check the CPUs are compatible in respect to provided CPU features. So as long as both CPUs report the same features they will be reported as compatible and it won't be a bug. If they report different features, virConnectCompareCPU should report they are incompatible. NB, that virConnectCompareCPU only compares the features exposed by the physical CPUs. When running a guest limitations of KVM and/or QEMU may prevent some features being exposed to the guest. This filtering may vary between KVM/QEMU versions. So even if virConnectCompareCPU says the hosts are identical, it has not verified whether KVM/QEMU expose the same features to the guests. If you have the same KVM/QEMU versions on each host this isn't a problem, but be aware of this edge case if you have differing versions of KVM/QEMU As I understand it, the issue in this case is that HT is available on both source and dest, but is only enabled on the dest, causing the number of performance monitoring counters per logical processor to be smaller on the destination than on the source. If this is the case, then it is a host CPU mismatch and arguably virConnectCompareCPU() should catch it. Alternately, qemu should make it clear in the docs that live migration with "-cpu host" not only requires the CPUs to be identical, but requires them to be configured identically by the BIOS/OS, specifically with respect to HT. I'm sympathetic that something should catch it somewhere; I suspect there are other situations as well (e.g. if you explicitly enable performance counters on a non 'host' cpu and your source and destination are different generations with different numbers of counters). We also thought it would be nice if we could pass the number of supported counters to qemu so that you could setup a minimum that would enable you to migrate between these types of hosts. > I'm sympathetic that something should catch it somewhere; I suspect there are other situations as well (e.g. if you explicitly enable performance counters on a non 'host' cpu and your source and destination are different generations with different numbers of counters). We also thought it would be nice if we could pass the number of supported counters to qemu so that you could setup a minimum that would enable you to migrate between these types of hosts. Hey David, I'm investigating something similar issue: https://bugzilla.redhat.com/show_bug.cgi?id=2066222 I was wondering was there anything done to mitigate this problem since your last update here? Thanks Barak. Hi Barak, No I don't think we have anything more to help there. I'd generally advise against using the host cpu type, because it's so sensitive to *anything* that's different - especially if you're enabling performance counters. Without the performance counters you'd probably get away with the HT difference. |